Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

目录 Contents

前言 xi Preface xi

1 医学研究中的统计学 1 1 Statistics in medical research 1

1.1 广义统计学 1
1.1 Statistics at large 1

1.2 医学统计学 3
1.2 Statistics in medicine 3

1.3 医学研究中的统计学 4
1.3 Statistics in medical research 4

1.4 统计学涵盖哪些内容? 5
1.4 What does statistics cover? 5

1.5 本书的范围 8
1.5 The scope of this book 8

2 数据类型 10 2 Types of data 10

2.1 引言 10
2.1 Introduction 10

2.2 分类数据 10
2.2 Categorical data 10

2.3 数值数据 11
2.3 Numerical data 11

2.4 其他类型的数据 13
2.4 Other types of data 13

2.5 截尾数据 16
2.5 Censored data 16

2.6 变异性 17
2.6 Variability 17

2.7 数据类型的重要性 17
2.7 Importance of the type of data 17

2.8 处理数字 17
2.8 Dealing with numbers 17

3 描述数据 19 3 Describing data 19

3.1 引言 19
3.1 Introduction 19

3.2 平均数 21
3.2 Averages 21

3.3 描述变异性 22
3.3 Describing variability 22

3.4 量化变异性 31
3.4 Quantifying variability 31

3.5 Two variables 38
3.5 双变量 38

3.6 数据变换的影响 41
3.6 The effect of transforming the data 41

3.7 数据展示 42
3.7 Data presentation 42

练习 45
Exercises 45

4 理论分布 48 4 Theoretical distributions 48

4.1 引言 49
4.1 Introduction 49

4.2 概率 50
4.2 Probability 50

4.3 样本与总体 50
4.3 Samples and populations 50

4.4 概率分布 51
4.4 Probability distributions 51

4.5 正态分布 51
4.5 The Normal distribution 51

4.6 对数正态分布 60
4.6 The Lognormal distributions 60

4.7 二项分布 63
4.7 The Binomial distribution 63

4.8 泊松分布 66
4.8 The Poisson distribution 66

4.9 数学计算 68
4.9 Mathematical calculations 68

4.10 均匀分布 71
4.10 The Uniform distribution 71

4.11 结语 71
4.11 Concluding remarks 71

练习 71
Exercises 71

5 研究设计 74 5 Designing research 74

5.1 引言 74
5.1 Introduction 74

5.2 研究设计的类别 75
5.2 Categories of research design 75

5.3 变异的来源 78
5.3 Sources of variation 78

5.4 一个实验:两臂血压是否相同? 79
5.4 An experiment: is the blood pressure the same in both arms? 79

5.5 实验设计 80
5.5 The design of experiments 80

5.6 实验的结构 83
5.6 The structure of an experiment 83

5.7 随机分配 85
5.7 Random allocation 85

5.8 最小化法 91
5.8 Minimization 91

5.9 观察性研究 91
5.9 Observational studies 91

5.10 病例对照研究 93
5.10 The case- control study 93

5.11 队列研究 96
5.11 The cohort study 96

5.12 横断面研究 99
5.12 The cross- sectional study 99

5.13 时间变化研究 101
5.13 Studies of change over time 101

5.14 选择研究设计 102
5.14 Choosing a study design 102

练习 103
Exercises 103

6 使用计算机 107 6 Using a computer 107

6.1 引言 107
6.1 Introduction 107

6.2 使用计算机的优点 107
6.2 Advantages of using a computer 107

6.3 使用计算机的缺点 108
6.3 Disadvantages of using a computer 108

6.4 统计软件类型 110
6.4 Types of statistical program 110

6.5 统计软件包的评估 111
6.5 Evaluating a statistical package 111

6.6 计算机辅助分析策略 112
6.6 Strategy for computer- aided analysis 112

6.7 数据收集表格 114
6.7 Forms for data collection 114

6.8 绘图 119
6.8 Plotting 119

6.9 计算机的其他用途 120
6.9 Other uses of computers 120

6.10 计算机的误用 120
6.10 Misuses of the computer 120

6.11 结语 121
6.11 Concluding remarks 121

7 准备分析数据 122 7 Preparing to analyse data 122

7.1 引言 122
7.1 Introduction 122

7.2 数据检查 122
7.2 Data checking 122

7.3 离群值 126
7.3 Outliers 126

7.4 缺失数据 130
7.4 Missing data 130

7.5 数据筛查 132
7.5 Data screening 132

7.6 为什么要转换数据? 143
7.6 Why transform data? 143

7.7 数据的其他特征 146
7.7 Other features of the data 146

7.8 结语 149
7.8 Concluding remarks 149

练习题 149
Exercises 149

8 统计分析原则 152 8 Principles of statistical analysis 152

8.1 引言 152
8.1 Introduction 152

8.2 抽样分布 153
8.2 Sampling distributions 153

8.3 样本均值分布的演示 155
8.3 A demonstration of the distribution of sample means 155

8.4 估计 160
8.4 Estimation 160

8.5 假设检验 165
8.5 Hypothesis testing 165

8.6 非参数方法 171
8.6 Non- parametric methods 171

8.7 统计建模 173
8.7 Statistical modelling 173

8.8 估计还是假设检验? 174
8.8 Estimation or hypothesis testing? 174

8.9 数据分析策略 175
8.9 Strategy for analysing data 175

8.10 结果展示 176
8.10 Presentation of results 176

8.11 小结 177
8.11 Summary 177

练习 177
Exercises 177

9 比较组别—连续数据 179 9 Comparing groups - continuous data 179

9.1 引言 179
9.1 Introduction 179

9.2 选择合适的分析方法 179
9.2 Choosing an appropriate method of analysis 179

9.3 分布 181
9.3 The distribution 181

9.4 一组观测值 183
9.4 One group of observations 183

9.5 两组配对观测值 189
9.5 Two groups of paired observations 189

9.6 两组独立观测值 191
9.6 Two independent groups of observations 191

9.7 偏态数据的分析 199
9.7 Analysis of skewed data 199

9.8 三组或以上独立观测值 205
9.8 Three or more independent groups of observations 205

9.9 单因素方差分析—数学原理与实例 218
9.9 One way analysis of variance - mathematics and worked example 218

9.10 结果展示 220
9.10 Presentation of results 220

9.11 总结 222
9.11 Summary 222

练习 223
Exercises 223

10 比较组别—分类数据 229 10 Comparing groups - categorical data 229

10.1 引言 229
10.1 Introduction 229

10.2 单一比例 230
10.2 One proportion 230

10.3 两个独立组的比例 232
10.3 Proportions in two independent groups 232

10.4 两个配对比例 235
10.4 Two paired proportions 235

10.5 多个比例的比较 241
10.5 Comparing several proportions 241

10.6 频数表的分析 241
10.6 The analysis of frequency tables 241

10.7 频数表—两个比例的比较 259
10.7 frequency tables - comparison of two proportions 259

10.8 表—多个比例的比较 259
10.8 tables - comparison of several proportions 259

10.9 有序类别的大型表 265
10.9 Large tables with ordered categories 265

10.10 表—匹配变量的分析 266
10.10 tables - analysis of matched variables 266

10.11 风险比较 266
10.11 Comparing risks 266

10.12 结果展示 271
10.12 Presentation of results 271

10.13 小结 271
10.13 Summary 271

练习 272
Exercises 272

11 两个连续变量之间的关系 277 11 Relation between two continuous variables 277

11.1 关联、预测与一致性 277
11.1 Association, prediction and agreement 277

11.2 相关性 278
11.2 Correlation 278

11.3 相关性的使用与误用 282
11.3 Use and misuse of correlation 282

11.4 秩相关 285
11.4 Rank correlation 285

11.5 调整相关性以控制其他变量 288
11.5 Adjusting a correlation for another variable 288

11.6 相关系数在评估非正态性中的应用 291
11.6 Use of the correlation coefficient in assessing non- Normality 291

11.7 相关性的数学原理与实例解析 293
11.7 Correlation - mathematics and worked examples 293

11.8 相关性的解释 297
11.8 Interpretation of correlation 297

11.9 相关性的展示 300
11.9 Presentation of correlation 300

11.10 回归 300
11.10 Regression 300

11.11 回归的应用 306
11.11 Use of regression 306

11.12 拓展内容 309
11.12 Extensions 309

11.13 回归—数学与实例解析 311
11.13 Regression - mathematics and worked example 311

11.14 回归的解释 316
11.14 Interpretation of regression 316

11.15 与其他分析方法的关系 318
11.15 Relation to other analyses 318

11.16 回归结果的展示 319
11.16 Presentation of regression 319

11.17 回归还是相关? 320
11.17 Regression or correlation? 320

练习题 321
Exercises 321

12 多变量之间的关系 325 12 Relation between several variables 325

12.1 引言 325
12.1 Introduction 325

12.2 方差分析与多元回归 325
12.2 Analysis of variance and multiple regression 325

12.3 双因素方差分析 326
12.3 Two way analysis of variance 326

12.4 多元回归 336
12.4 Multiple regression 336

12.5 逻辑回归 351
12.5 Logistic regression 351

12.6 判别分析 358
12.6 Discriminant analysis 358

12.7 其他方法 360
12.7 Other methods 360

习题 361
Exercises 361

13 生存时间分析 365 13 Analysis of survival times 365

13.1 引言 365
13.1 Introduction 365

13.2 生存概率 367
13.2 Survival probabilities 367

13.3 两组生存曲线比较 371
13.3 Comparing survival curves in two groups 371

13.4 数学计算与实例解析 377
13.4 Mathematical calculations and worked examples 377

13.5 不正确的分析 385
13.5 Incorrect analyses 385

13.6 生存建模—Cox回归模型 387
13.6 Modelling survival - the Cox regression model 387

13.7 生存研究的设计 393
13.7 Design of survival studies 393

13.8 结果的呈现 393 练习 394
13.8 Presentation of results 393 Exercises 394

14 医学研究中的一些常见问题 396 14 Some common problems in medical research 396

14.1 引言 396
14.1 Introduction 396

14.2 方法比较研究 396
14.2 Method comparison studies 396

14.3 评审者间一致性 403
14.3 Inter- rater agreement 403

14.4 诊断测试 409
14.4 Diagnostic tests 409

14.5 参考区间 419
14.5 Reference intervals 419

14.6 连续测量 426
14.6 Serial measurements 426

14.7 周期性变化 433
14.7 Cyclic variation 433

练习 435
Exercises 435

15 临床试验 440 15 Clinical trials 440

15.1 引言 440
15.1 Introduction 440

15.2 临床试验设计 441
15.2 Design of clinical trials 441

15.3 样本量 455
15.3 Sample size 455

15.4 分析 461
15.4 Analysis 461

15.5 结果解释 471
15.5 Interpretation of results 471

15.6 临床试验的撰写与评估 473
15.6 Writing up and assessing clinical trials 473

练习 474
Exercises 474

16 医学文献 477 16 The medical literature 477

【16】1 引言 477
16.1 Introduction 477

【16】2 医学研究中统计学的发展 478
16.2 The growth of statistics in medical research 478

【16】3 已发表论文中的统计学 481
16.3 Statistics in published papers 481

【16】4 阅读科学论文 493
16.4 Reading a scientific paper 493

【16】5 撰写科学论文 498
16.5 Writing a scientific paper 498

练习 499
Exercises 499

附录 A 数学符号 505 Appendix A Mathematical notation 505

A1.1 引言 505
A1.1 Introduction 505

A1.2 基本概念 505
A1.2 Basic ideas 505

A1.3 数学符号 509
A1.3 Mathematical symbols 509

A1.4 函数 510
A1.4 Functions 510

A1.5 符号词汇表 510
A1.5 Glossary of notation 510

附录 B 统计表 514 Appendix B Statistical tables 514

习题答案 546
Answers to exercises 546

参考文献 575
References 575

索引 589
Index 589

前言 Preface

许多聪明人在“求和”方面遇到的困难是无穷无尽的。Greenwood(1948)
The difficulties many intelligent people have with 'sums' are infinite. Greenwood (1948)

本书关于统计学,主要面向医学研究人员。无论是临床还是非临床领域,大多数人在本科阶段都曾接受过一些统计学教学,但通常时间较短,且往往是在很久以前,到了真正需要时已大多遗忘。本书同样适用于医学生、希望理解研究设计与分析原理的临床医生,以及参加医学统计学研究生课程的人士。
This book on statistics is primarily aimed at medical researchers. Whether clinical or non- clinical, most will have received some statistics teaching as undergraduates, but it will have been fairly brief, a long time ago, and largely forgotten by the time it is needed. The book should also be useful to medical students, to clinicians who wish to understand the principles of the design and analysis of research, and to those attending postgraduate courses in medical statistics.

我写这本书的动机源于这样一种信念:大多数入门教材未能充分解释统计学这一整体学科背后的核心概念,且在很多情况下,它们与实际开展和评估医学研究的现实脱节。本书旨在提供对研究设计、数据分析及结果解释的基本原理的理解,并使读者能够进行广泛的统计分析。重点明确放在医学研究设计与分析的实用方面,特别关注结果的解释与呈现。通过讨论统计学的正确使用与误用,本书还应为读者提供判断医学期刊发表论文中方法及解释是否恰当的依据。
I have been motivated to write this book by the belief that most introductory texts do not explain adequately the concepts that underlie the whole subject of statistics, and in many cases they are divorced from the reality of carrying out and assessing medical research. This book should provide an understanding of the basic principles that underlie research design, data analysis and the interpretation of results, and enable the reader to carry out a wide range of statistical analyses. The emphasis is firmly on practical aspects of the design and analysis of medical research and I have paid special attention to the interpretation and presentation of results. By discussing both the use and misuse of statistics the book should also give the reader the material to be able to judge the appropriateness of the methods and interpretation in papers published in medical journals.

我假设大多数研究人员现已能使用计算机,因此数学细节一般局限于可独立阅读的章节,读者可选择跳过。全书贯穿真实数据,大多来自已发表论文,我尽量选择本身具有趣味性的数据。大多数情况下,提供了所有原始数据,便于读者通过计算机或手工计算复现分析过程。这一特点有助于评估统计计算程序。
I have assumed that most researchers now have access to a computer so that the mathematical details are generally confined to self- contained sections that may be omitted. I have used real data throughout, mostly from published papers, and I have tried to find data that are interesting in their own right. In most cases all the raw data are given, so that the analyses can be reproduced either by computer or by hand calculation. This feature will assist in the evaluation of a statistical computer program.

许多数据集取自已发表的论文,虽不总是用于作者最初的研究目的。有些数据集是从图表或汇总统计中重构的,另一些则来自我自身的合作研究。感谢所有提供数据的人士。我努力公正呈现这些研究,若有不足之处,敬请谅解。特别感谢《英国医学杂志》和《英国妇产科杂志》允许我复制图表,感谢Ciba-Geigy有限公司、生物计量学信托基金以及Oliver and Boyd出版社允许我复制统计表格。
Many data sets have been taken from published papers, not always used for the authors' original purpose. Some data sets have been reconstructed from graphs or summary statistics and others have come from my own collaborative studies. I thank everyone whose data I have used. I have tried to represent these studies fairly, and apologise if I have failed at all in this respect. I am grateful to the British Medical Journal and the British

几乎所有图表均采用STATA和STAGE软件(洛杉矶计算资源中心)制作。
Journal of Obstetrics and Gynaecology for permission to reproduce figures and Ciba- Geigy Ltd, the Biometrika Trustees and Oliver and Boyd for permission to reproduce statistical tables. Almost all of the figures were produced using STATA and STAGE (Computing Resource Center, Los Angeles).

感谢所有帮助我撰写本书的人。整本书的草稿由Martin Bland、Caroline Dore、Sheila Gore和Richard Wootton通读,我对此深表感激;同时感谢Peter Clark、Bianca De Stavola、David Hill和Patrick Royston审阅部分章节。他们的意见和建议极具价值,但书中如有不当之处,责任在我。特别感谢Judy MacDonald打字并处理多次修改;同样感谢Olive Waldron和Clare Wood打过早期稿件。最后,感谢Sue的鼓励和支持。
I wish to thank everyone who has helped me to write this book. The whole book was read in draft by Martin Bland, Caroline Dore, Sheila Gore and Richard Wootton to whom I am especially grateful, and I also thank Peter Clark, Bianca De Stavola, David Hill and Patrick Royston for reading certain chapters. Their comments and suggestions have been enormously valuable, but I must take the blame for any remaining infelicities or errors. I especially thank Judy MacDonald for typing the manuscript, and dealing with numerous revisions; thanks too to Olive Waldron and Clare Wood who typed early drafts. Lastly I thank Sue for her encouragement and support.

Douglas Altman 1990
Douglas Altman 1990

1 医学研究中的统计学 1 Statistics in medical research

一想到统计学,收藏家走过混乱的驻地花园,心中顿时充满喜悦……统计学不正是对混沌宇宙的整理吗?统计学是锁链,束缚着无知与迷信的恶徒,这些恶徒在孤寂的小径上扼杀真理。—J. G. Farrell,《克里希纳普尔围城记》
At the thought of statistics, the Collector, walking through the chaotic Residency garden, felt his heart quicken with joy…. For what were statistics but the ordering of a chaotic universe? Statistics were the leg- irons to be clapped on the thugs of ignorance and superstition which strangled Truth in lonely byways.J. G. Farrell, The Siege of Krishnapur

J.G. Farrell,《克里希纳普尔围城记》
J.G.Farrell, The Siege of Krishnapur

1.1 统计学的广泛应用 1.1 STATISTICS AT LARGE

我们正受到前所未有的统计信息轰炸。报纸中充斥着大量统计数据,涉及贸易与工业、金融、(失业与就业)、交通事故数据等,还有频繁发布的民意调查和问卷调查结果。以这种方式呈现的统计信息可靠性参差不齐。尽管政治民意调查采用了相对可靠的方法,大多数调查则基于对某些便利群体的提问,且不考虑其代表性。甚至有些调查基于自愿提供的信息,如电话投票。
We are bombarded with statistics to an unprecedented degree. Newspapers contain a wealth of statistical information, relating to trade and industry, finance, (un)employment, road accident figures and the like, and there are frequent results of opinion polls and surveys. Statistics presented in this way are of varying reliability. While political opinion polls are performed with reasonably reliable methods, most surveys are based on asking questions of some convenient group of people, with no concern for their representativeness. They may even be based on volunteered information, as in phone- in polls.

媒体中也常见医学研究报道。研究结果通常基于严谨的方法学,但由于结果可能以类似方式呈现,其可靠性差异未被广泛识别。例如,报纸会用类似措辞报道关于因沙门氏菌担忧而对鸡蛋消费态度的民意调查结果,以及流行病学研究调查避孕药使用与乳腺癌风险关系的结果。许多医学问题过于复杂,难以在报纸或电视的简短报道中得到充分处理。诸如核电站周围儿童白血病发病率升高的可能性,或饮用水中添加氟的致癌效应等话题,都需要深入探讨许多复杂问题。氟化物争论的复杂性可见一斑:一场法庭审理持续了201天,其中大量证据为统计数据(Oldham,1985年)。
It is also common to see reports of medical research in the media. Research findings are usually based on sound methodology, but as the results may be presented in a like manner the distinction in reliability is not widely perceived. For example, newspapers will report in similar terms the findings of a poll about attitudes to consumption of eggs in the light of worries about salmonella and also the results of an epidemiological study investigating the relation between use of the contraceptive pill and risk of breast cancer. Many medical issues are really too complex to be dealt with adequately in a short item in a newspaper or on television. Topics such as the possibility of raised rates of childhood leukaemia around nuclear power stations or the carcinogenic effect of adding fluoride to drinking water require an in- depth consideration of many complicated issues. The complexity of the fluoride debate may be judged by the fact that a court case lasted 201 days, with much of the evidence being statistical (Oldham, 1985).

“研究”一词具有强烈的内涵,隐含着可靠性的保证。领域外的人很少关心研究是如何进行的,只关注研究结果。我曾见过一个广告利用了这一弱点。该公司在推广桌面装订系统时开头就说:“研究显示,精心呈现的文件被正确阅读和良好接受的概率提高了95%。”我怀疑是否真的做过这样的研究,甚至是否可能做,但“研究”的力量被成功地借用。
The word 'research' has powerful connotations, with reliability being implicit. Few people outside the relevant field are concerned about how the research was done, only about what was found. One recent advertisement I have seen makes use of this weakness. The company supports its promotion of desk top binding systems with the opening comment that 'Research shows that a well presented document stands a better chance of being properly read and well received'. I doubt whether any such research had been carried out, or even if it could be, but the power of research is successfully invoked.

与这段荒谬言论形成鲜明对比的是以下来自报纸报道(《卫报》,1986年8月23日)的一段医学期刊论文摘要:
At the other extreme from this piece of nonsense is the following excerpt from a newspaper report (Guardian, 23 August 1986) of a paper in a medical journal:

心脏风险评分系统
Score system for heart risk

医生们昨天宣布,已设计出一种廉价的“速算表”,用于识别高风险心脏病发作男性。昂贵的心电图测试和血胆固醇测量可以被简单的评分系统取代。该系统能识别出超过一半未来五年内可能发生心脏病发作的男性,他们随后可被建议采取更健康的生活方式或接受治疗。……该系统需要测量血压,估计吸烟年限,了解既往是否患有心绞痛、心脏病发作或糖尿病,以及父母中是否有人死于心脏病。
A cheap 'ready reckoner' for identifying men at high risk of a heart attack has been devised by doctors it was announced yesterday. Expensive electrocardiograph tests and measurements of blood cholesterol levels can be supplanted by a simple scoring system. The system can identify more than half of the men likely to have a heart attack over the next five years, who can then be advised to adopt a healthier lifestyle or offered treatment. … The system requires measurement of blood pressure, an estimate of the number of years of cigarette smoking, knowledge of previous angina, heart attack or diabetes, and whether either parent died of heart trouble.

显然,这项研究对成千上万的男性具有潜在价值。这些结果可靠吗?“速查表”是如何制定的?当然,我们不会指望从一篇简短的报纸文章中获得这些信息,但没有提供研究是如何进行的细节,这使得这项研究与同一报纸中报道的其他统计数据没有本质区别。
Clearly this study is potentially valuable to thousands of men. Are these results reliable and how was the 'ready reckoner' devised? Of course we would not expect to obtain this information from a short newspaper article, but the fact that no information is given about how the study was performed may put it in no better light than any other statistics reported in the same newspaper.

另一个例子是一篇报纸文章(《卫报》,1988年5月19日),报道了一项关于寿命与左撇子关系的研究:
Another example is given by a newspaper article (Guardian, 19 May 1988) reporting a study of the relation between longevity and left- handedness:

科学家说,左撇子的老人不多了
Not many old hands left, says scientist

如果你超过80岁且是左撇子,你就是独一无二的。几乎所有其他左撇子都已经去世……昨晚哈珀恩博士说,这只是一个小样本。科学家们普遍发现,直到33岁之前,死亡率没有差异;从那时起,左撇子慢慢减少。
If you're over 80 and left-handed, you're in a class of your own. Nearly all the other left-handers have passed on… It was, said Dr Halpern last night, a small sample. Generally the scientists found that there was no difference in death rates up till the age of 33; from then the left-handed slowly fade away.

她提出了几个原因。一个可能是低体重婴儿倾向于成为左撇子,而低出生体重可能意味着生存机会减少。另一个原因是这是一个右手为主的世界。左撇子在使用汽车和电动工具时处于劣势,承受更大压力,事故也更多。
She offered several reasons. One might be that low-weight babies tended to be left-handed, and low birthweight might mean reduced chances of survival. The other was that it was a right-handed world. The left-handed were simply at a disadvantage with automobiles and power tools, suffered from greater stress and had more accidents.

我将在第5章解释为何这项研究结果的解释不成立。目前我只想指出,这项研究的发现被不加批判地报道,最后一段包含未经支持的推测,这些推测甚至未出现在出版物中(Halpern和Coren,1988)。报纸读者无法辨别这两篇报道中信息的可靠性。然而,科学标准的差异,也就是研究结果的有效性,正是医学研究争议的根源。这些争议常常影响日常生活,使公众对许多食物和药物的潜在健康危害感到困惑(Feinstein,1988)。
I shall explain in Chapter 5 why the interpretation of the results from this study is not valid. For the moment I shall just note that the study findings are reported uncritically, and the last paragraph contains unsupported speculations which do not even appear in the publication (Halpern and Coren, 1988). There is no way that readers of the newspaper could distinguish the reliability of the information in the two newspaper reports. Yet it is variation in scientific standards, and hence the validity of research findings, that fuels controversies in medical research. These often impinge on daily life, such that the public becomes confused about possible adverse health effects of numerous foods and drugs (Feinstein, 1988).

通常,报道重点放在结果上(被当作事实呈现),而很少或根本不关注获取结果的方法,这可能是统计学被广泛视为仅与数据分析和数字结果呈现相关的原因。虽然这些确实是统计学的重要部分,但统计学远不止于此。尤其是数据如何以及为何被收集,这一点极其重要。
In general the emphasis is on results (which are presented as facts), with little or no regard to the manner in which they were obtained, which is probably why the subject of statistics is widely seen as relating solely to the analysis of data and the presentation of numerical results. While these are important parts of statistics, there is much else besides. In particular, how and why the data were collected are supremely important.

统计学(或统计学家)普遍被认为不可信,这种看法体现在“统计数据可以证明任何事情”的说法中。如果这句话有意义,它表明数字可以以多种方式呈现,且通常会选择最有利的视角。虽然这种看法有其理由,但“统计可以证明任何事情”并不正确;至少在研究中使用的统计方法方面,情况正好相反。统计分析让我们能够界定不确定性的范围,但不能证明任何事情。尽管存在对统计学的相当不信任,公众却往往无批判地接受研究结果,这可能归因于印刷文字的影响力。
There is a wide perception of statistics (and perhaps statisticians too) as untrustworthy, as embodied in the idea that 'you can prove anything with statistics'. This saying, if it means anything, suggests that figures can be presented in a variety of ways, and that it is common for the most favourable view to be selected. While there is justification for this belief, it is not true that you can prove anything with statistics; the opposite is true, at least with regard to statistical methods used in research. Statistical analysis allows us to put limits on our uncertainty, but not to prove anything. Despite considerable mistrust of statistics, there is a tendency towards uncritical acceptance by the public of research findings, which may be attributed to the power of the printed word.

1.2 医学中的统计学 1.2 STATISTICS IN MEDICINE

统计学在医学实践中日益普及。如今,医院效用统计、审计、资源分配、疫苗接种率、艾滋病新发病例数等问题备受关注。医生的期刊和杂志充满了这类统计材料以及个别研究的结果。统计问题隐含于所有临床实践中,无论是做诊断还是选择合适的治疗方案。
Statistics are increasingly prevalent in medical practice. Nowadays much concern is devoted to hospital utility statistics, audit, resource allocation, vaccination uptake, numbers of new cases of AIDS, and so on. Journals and magazines for doctors are full of statistical material of this sort, as well as the findings of individual research studies. Statistical issues are implicit in all clinical practice when making diagnoses and choosing an appropriate treatment.

在药物(尤其是药物)和其他医疗疗法的宣传材料中引用研究论文中的统计结果变得越来越常见。举例来说,以下文字摘自1989年发表在临床肿瘤学杂志上的一则白血病治疗广告(我仅更改了药物名称):
It is increasingly common to see statistical results from research papers quoted in promotional materials for drugs (especially) and other medical therapies. As an example, the following text is from an advertisement for a treatment for leukaemia appearing in a clinical oncology journal in 1989 (I have only changed the names of the drugs):

NOVORAN首次疗程反应者显著更多
Significantly more first-course responders with NOVORAN

  • 在所有接受NOVORAN治疗的成人ANLL患者中,完全缓解率为63%,而接受orsoran治疗的患者为53%

  • of all adults with ANLL treated with NOVORAN had a complete remission, compared with of all patients treated with orsoran

  • 接受NOVORAN诱导治疗一疗程后,56%的患者达到完全缓解,而接受orsoran治疗的患者为36%

  • of patients had a complete remission after one induction course with NOVORAN, compared with of patients treated with orsoran

  • 对NOVORAN完全缓解的患者中,有89%在单次诱导疗程后响应,而对orsoran完全缓解的患者中仅为68%。

  • of complete responders to NOVORAN responded after a single induction course, compared with only of complete responders to orsoran.

  • 单自由度 检验

  • Single df

要理解这段文字,必须知道诸如 这类表达式的含义,以及可能还要理解那个奇怪的脚注。更重要的是,我们还想知道研究的规模和设计。因此,了解这些百分比是如何获得的及其解释方法,对于所有治疗患者的人员来说至少是有用的,甚至可以说是必不可少的。(本例中无法获得这些信息,因为该研究结果被报告为“存档”,即未发表。)
To understand this passage it is necessary to know the meaning of expressions like , and perhaps also the curious footnote. More importantly, however, we should wish to know how large the study was and what the design was. An appreciation of the methods by which these percentages were obtained and how to interpret them is thus at least useful and arguably essential for all those who treat patients. (We cannot obtain the information in this case as the results of this study were reported as being 'on file', i.e. unpublished.)

对于从事研究的人来说,统计问题是根本性的,因此理解与研究设计和数据分析相关的基本统计理念,并熟悉最常用的统计分析方法,极为重要。
For those doing research statistical issues are fundamental, and so it is extremely important to understand basic statistical ideas relating to research design and data analysis, and to be familiar with the most common methods of statistical analysis.

1.3 医学研究中的统计学 1.3 STATISTICS IN MEDICAL RESEARCH

Colton(1974,第1页)指出“统计学渗透于医学文献之中”。此后,统计学在医学研究中的大量涌入持续不断。其目的是提高医学研究结果的可靠性和可信度,但并不保证统计方面处理得当甚至有效。正如我将在最后一章中展示的,有大量证据表明许多已发表论文存在统计错误。
Colton (1974, p.1) observed that 'statistics pervades the medical literature'. Since then the huge influx of statistics into medical research has continued. The aim is to improve the reliability and credibility of the findings from medical research, but there is no guarantee that the statistical aspects have been handled well or even validly. As I shall show in the final chapter, there is considerable evidence that many published papers contain statistical errors.

统计学中的错误之所以值得关注,有很多原因。最简单地说,如果存在统计错误,研究的结论可能是错误的。论文的读者可能不会发现错误,从而在临床实践或后续研究中被误导。虽然这一论点可能高估了单篇发表论文的影响力,但有大量证据表明,医学期刊的读者以及公众通常会不加批判地接受印刷的内容。
There are many reasons why errors in statistics are a matter for concern. Put most simply, if there are statistical errors the conclusions of the study may be incorrect. Readers of the paper may not detect the error and may be misled either with respect to clinical practice or further research. While this argument may overestimate the influence of a single published paper, there is much evidence that readers of medical journals accept uncritically the printed word, as does the general public.

还有一种类似的看法认为统计学就是数据分析,或许因为这是统计贡献中最显而易见的部分。数据分析无疑是统计学的重要组成部分,但这种狭隘的观点
There is also a similar belief that statistics is about data analysis, perhaps because this is the most visible part of the statistical contribution. Data analysis is certainly an important part of statistics, but this narrow view

特别排除了与研究设计相关的关键方面。没有良好设计的坚实基础,分析的结构是不安全的。可靠的结果依赖于适当的研究设计:“分析的正当性不在于收集的数据本身,而在于数据的收集方式”(Schoolman 等,1968)。医学中许多争议都可以追溯到研究设计质量的差异。
excludes in particular vital aspects relating to the design of research. Without the solid foundations of a good design the edifice of analysis is unsafe. Reliable results depend upon an appropriate research design: 'The justification for the analysis lies not in the data collected but in the manner in which the data were collected' (Schoolman et al., 1968). Many controversies in medicine are traceable to varying quality of the design of the research.

1.4 统计学涵盖哪些内容? 1.4 WHAT DOES STATISTICS COVER?

图1.1展示了研究项目的一般步骤序列。统计思维可以贯穿每个阶段,尽管设计、分析和解释这几个主要步骤将是本书的重点。
Figure 1.1 shows the general sequence of steps in a research project. Statistical thinking can contribute to every stage, although the major steps of design, analysis and interpretation will be the prime focus of this book.

医学研究与临床实践的关键区别在于其范围。两者都从个体受试者收集数据,但医学研究的目标是能够对更广泛的受试者群体做出一些普遍性陈述,而我们通常并不特别关注所研究的具体个体。因此,我们使用来自样本个体的信息,对类似个体的更大群体进行推断。加粗的三个词是正式的统计术语,将在后续章节中详细解释。这里重要的一点是,所研究的受试者充当了感兴趣总体的代表。
The key difference between medical research and clinical practice is their scope. In each, data are collected from individual subjects, but in medical research the aim is to be able to make some general statements about a wider set of subjects, and we are not usually especially interested in the particular subjects that have been studied. We thus use information from a sample of individuals to make some inference about the wider population of like individuals. The three words in bold are formal statistical terms that will be explained fully in later chapters. The important point here is that the subjects who are studied act as a proxy for the total group of interest.


图1.1 研究项目的一般步骤序列。
Figure 1.1 General sequence of steps in a research project.

1.4.1 研究设计 1.4.1 Research design

我们不可能研究所有糖尿病患者、所有孕妇或某一地理区域内的所有人。例如,如果我们希望研究孕期母体体重增加与婴儿出生体重之间的关系,就必须研究一组孕妇样本。该研究的目标是将样本的发现推广到所有孕妇。为了使这种推断合理,样本必须具有代表性。理论上,只有通过随机选择女性(第5章将解释这一概念)才能获得真正具有代表性的样本,但即便如此,样本也仅限于特定时间段和地理区域。实际上,样本几乎总是系统性选取,并描述受试者的特征,以便判断其代表性。刚才提到的研究可能通过在特定时间段内选取一个或多个医院登记的所有孕妇来进行。在大多数研究中,需要排除某些人群。这里,晚期登记的孕妇必须排除,因为她们无法提供足够的体重数据。众所周知,这部分人群在许多方面并不典型。我们可能还会排除早产儿(<37周),以及其他一些次要的排除原因,如糖尿病和双胞胎。
We can never study all diabetics, all pregnant women, or all people living in a geographical area. If we wish to investigate, for example, the relation between maternal weight gain in pregnancy and baby's birth weight we must study a sample of pregnant women. The aim of this research would be to extrapolate the findings from this sample to all pregnancies. For this inference to be reasonable, it is necessary for the sample of women to be representative of all pregnant women. In theory we can obtain a truly representative sample only by choosing women at random (a concept explained in Chapter 5) but even then the sample would be specific to a time period and geographical area. In practice, samples are nearly always chosen systematically and the subjects' characteristics are described so that their representativeness can be judged. The study just proposed would probably be carried out by taking all women registering at one or more specific hospitals in a set time period. In most studies it is necessary to exclude some people. Here women registering late in pregnancy would have to be excluded because they would not provide sufficient weight data. It is well known that this group is untypical in many respects. We might also wish to exclude premature births (<37 weeks) and there would probably be some other minor reasons for exclusion, such as diabetes and twins.

通常,研究报告会列出纳入和排除受试者的标准,并描述样本在研究开始时的重要特征;在本例中,包括年龄、产次(既往生育子女数)、身高和体重。随后,是否合理将样本研究结果视为所有孕妇的代表,则属于主观判断。
It is customary for the report of such a study to list the criteria for including or excluding subjects in the study, and to describe important characteristics of the sample at the start of the study; in this case these would include age, parity (number of previous children), height and weight. It is then a subjective matter to decide whether or not it is reasonable to take the findings from the study sample as being representative of all pregnant women.

比较研究涉及与刚才描述的观察性研究相同的考虑因素。例如,我们可能希望比较接受不同饮食建议的女性群体。在这里,我们还面临如何决定哪些女性接受哪种建议的问题。我们希望采用一种方法,使两组女性在年龄、产次和孕前体重上相似。此外,我们希望该方法排除对接受何种建议的主观影响。
A comparative study would involve the same considerations as the observational study just described. For example, we might wish to compare groups of women given different dietary advice. Here we have the additional issue of how to decide which women get which advice. We would like a method that would result in the women in the two groups being of similar age, parity and pre- pregnancy weight. Further, we want a method that excludes the possibility of subjective influence on who receives which advice.

上述所有问题都属于设计的范畴,因此是统计学对研究的贡献之一。另一个方面是确定研究的合适样本量。我希望这个例子能说明为什么正确的设计是良好研究的关键部分,因此在研究早期阶段获得良好的统计学支持非常重要。每个研究都会遇到不同的问题,但有许多良好设计的一般原则,这些将在第5章中讨论。
All the issues just described come under the broad heading of design. and are thus part of the statistical contribution to research. Another aspect is determination of a suitable sample size for the study. I hope that this example has illustrated some of the reasons why a correct design is an essential part of good research, and thus the importance of good statistical input at this early stage. Different problems arise in each study, but there are many general principles for good design, which are discussed in

临床试验将在第15章详细介绍。
Chapter 5. Clinical trials are considered in detail in Chapter 15.

研究设计的基础性作用导致研究论文中最重要的部分是方法部分。我们在这里了解研究是如何进行的,以及结果是否有用。例如,仅在体重高于平均水平的女性中进行的孕期体重增加研究,或仅限于低出生体重婴儿的妊娠,可能无论结果如何都缺乏意义;在英国进行的研究可能对非洲或亚洲的情况参考价值有限。更一般地说,我们不能从不具代表性或有偏样本中做出有效的推广。避免偏倚是健全研究设计的主要目标之一。
A consequence of the fundamental role of study design is that the most important part of a research paper is the Methods section. It is here that we learn what was done and if the results will be useful. A study of maternal weight gain carried out only on women of above average weight or restricted to pregnancies ending with low birth weight babies might be of no interest, regardless of the findings, and a study carried out in Britain may be of little relevance to the situation in Africa or Asia. Put more generally, we cannot make valid generalizations from unrepresentative or biased samples. The avoidance of bias is one of the main aims of sound research design.

前述高心脏病风险男性研究的报告发表于《英国医学杂志》(Shaper 等,1986)。其论文中的“受试者与方法”部分(此处略有删减)详细描述了研究的具体实施:
The report of the aforementioned study of men at high risk of heart attacks was published in the British Medical Journal (Shaper et al., 1986). The 'Subjects and Methods' section of their paper (slightly shortened here) described exactly how the study was carried out:

数据来源于英国区域心脏研究,该研究调查了7735名年龄在40至59岁的男性,这些男性是从英格兰、威尔士和苏格兰24个城镇具有代表性的全科诊所的年龄-性别登记册中随机选取的。24个城镇的人口均在5万至10万之间,涵盖了心血管疾病死亡率的全范围,并包括所有主要标准区域。每个城镇选定的全科诊所的社会阶层分布代表该城镇。男性受试者是从年龄-性别登记册中随机选取的;未尝试排除患有心血管疾病的个体,且响应率为78%。
The data used were derived from the British Regional Heart Study, which examined 7735 men aged 40- 59 randomly selected from the age- sex registers of representative group general practices in 24 towns in England, Wales and Scotland. The 24 towns were selected from those with populations of 50 000- 100 000; they represented the full range of cardiovascular disease mortality and included towns in all the major standard regions. The general practice selected in each town had a social class distribution representative of the town. The men were selected at random from age- sex registers; no attempt was made to exclude subjects with cardiovascular disease, and there was a response rate.

研究护士向每位男性发放问卷并完成体检。在本研究中,吸烟暴露以吸烟年数表示,不考虑吸烟量,因为吸烟年数与缺血性心脏病风险的关联最强。若受试者在问卷中表示运动时(如上坡行走或快走)出现胸痛,则视为患有心绞痛,包括明确和可能的心绞痛。本论文结果仅限于7506名(占97%)在上述所有风险因素上数据完整的男性。
Research nurses administered a questionnaire to and completed an examination of each man. In this study exposure to cigarette smoking was expressed as the number of years a man had smoked, irrespective of the quantity, as this was most strongly related to risk of ischaemic heart disease. Subjects were regarded as having angina if they indicated on the questionnaire that chest pain was present on exertion (walking uphill or hurrying). This included definite and possible angina. Results in this paper are confined to the 7506 men with complete data on all the above risk factors.

只有掌握了这些信息和分析方法的细节,我们才能对作者的结论是否适用于所有40至59岁男性做出恰当评估。超出该年龄范围的推断是不明智的。
Only when armed with this information and details of the methods of analysis, can we make a proper assessment of the appropriateness of the authors' conclusions to all men aged 40- 59. Extrapolation outside this age range is unwise.

然而,如果论文遗漏了重要信息,我们必须对其结论保持审慎。我将在最后一章讨论此类问题以及阅读医学论文的其他相关问题。
If, however, important information is omitted from a paper, then we must reserve judgement on the findings. I consider this and other issues regarding reading medical papers in the final chapter.

1.4.2 分析与解释 1.4.2 Analysis and interpretation

尽管有上述评论,数据分析仍是学习统计学的主要部分。分析方法多达数十种,这使得为特定案例选择正确的方法变得困难。然而,在关注具体方法之前,有必要考虑所有分析方法背后的哲学思想。我们将看到,统计分析方法基于一个关键理念:利用样本数据对更广泛的人群进行推断。当然,具体方法很重要,但首先需要掌握一般原则。第8章将讨论数据统计分析的主要一般方法,然后再介绍具体方法。
Despite the preceding comments the analysis of data is the major part of learning about statistics. There are dozens of different methods of analysis, which makes difficult the choice of the correct method for a particular case. Before worrying about particular methods, however, it is necessary to consider the philosophy that underlies all methods of analysis. We will see that statistical methods of analysis are based on the same key idea that we use data from a sample to draw inferences about a wider population. Of course particular methods are important, but the general principles need to be absorbed first. The main general approaches to the statistical analysis of data are considered in Chapter 8, before particular methods are introduced.

统计分析结果的解释并非总是直观明了,但当研究目标明确且理解分析背后的基本原则时,解释会更简单。事实上,如果研究设计合理且分析正确,结果的解释可以相当简单。
The interpretation of results of statistical analysis is not always straightforward, but is simpler when the study has a clear aim and when there is an appreciation of the general principles that underlie the analysis. Indeed, if the study has been well designed and correctly analysed the interpretation of results can be fairly simple.

1.5 本书的范围 1.5 THE SCOPE OF THIS BOOK

本书试图在介绍具体数据分析方法之前,突出统计设计和分析的概念与原则。因此,直到第9章才开始描述更常见的分析方法。前几章涵盖基础内容,包括设计和分析的主要思想、可能遇到的不同数据类型,以及如何使用计算机进行统计分析。第9至12章介绍主要的统计分析方法,第13至15章则讨论特定医学主题。第16章关注统计学在医学文献中的应用,并提供关于统计内容的医学论文阅读与写作建议。
In this book I have tried to give prominence to the concepts and principles of statistical design and analysis before considering specific methods of analysing data. Thus it is not until Chapter 9 that I start to describe the more familiar methods of analysis. The earlier chapters cover basic material including, as well as the main ideas of design and analysis, consideration of different types of data that may be encountered and advice on how to use a computer for statistical analysis. Chapters 9 to 12 describe the main methods of statistical analysis, while Chapters 13 to 15 consider specific medical topics. In Chapter 16 I look at the way statistics is used in the medical literature, and give advice on reading and writing medical papers with respect to the statistical content.

医学研究大致分为临床研究、实验室研究和流行病学,这些领域分别涉及人、人体样本或人群。每种情况下,研究对象可能是健康人和病人的混合体。我使用“临床”一词涵盖外科、牙科、护理、心理学等领域的研究。
Medical research falls into the broad areas of clinical research, laboratory research and epidemiology, which may be regarded as relating to people, samples from people or populations of people. In each case the individuals studied may be a mixture of healthy and ill people. I use the term 'clinical' to include research in surgery, dentistry, nursing, psychology and so on.

本书描述的统计方法适用于上述所有领域,尽管具体问题可能有所不同。然而,流行病学具有许多特殊特点和统计方法,这些内容在专业书籍中有详尽介绍。
The statistical methods described in this book apply to all of these areas, although the specific problems may vary. Epidemiology, however, has many special features and statistical methods, which are covered comprehensively in specialized texts.

编写统计学教材时遇到的一个问题是读者对数学方法熟悉程度的差异。
One problem when writing a statistics textbook is the likely variation among readers in their familiarity with mathematical methods. I have

为帮助不太擅长数学的读者,我采用了两种措施。首先,附录A中包含了数学符号表,简要解释所有使用的术语。其次,大多数章节将数学公式放在独立部分,便于读者阅读特定方法时不被复杂的公式分心。虽然使用计算机时不必掌握公式,但它们展示了分析的原理,除非极度敏感,我建议读者应当学习这些公式。后面章节中的高级方法未包含公式,因为它们非常复杂,且分析总是在计算机上完成。
adopted two devices to assist those who are less than comfortable with mathematics. Firstly, I have included an Appendix on mathematical notation (Appendix A), which includes brief explanations of all the terms used. Secondly, in most chapters I have put the mathematical formulae in self- contained sections, so that it is possible to read about a particular method without being confused or distracted by sometimes formidable looking equations. Although the formulae are not needed when using a computer, they do show the way in which the analysis works, and except in cases of extreme hypersensitivity I recommend that they should be examined. For the more advanced methods in later chapters I have not included the mathematical formulae as these are very complicated and the analyses are always done on a computer.

记忆公式并非必要—可以查阅。重要的是理解研究过程的一般原则,从制定目标到图1.1所示的各步骤,并意识到可推断内容的局限性。
It is not necessary to be able to remember formulae - these can be looked up. What is important is to understand the general principles of the research process, from formulating an objective through all the steps shown in Figure 1.1, and to be aware of the limitations of what may or may not be deduced.

我不假装统计学容易学。相反,我认为它相当困难。统计学是数学、逻辑和判断力的奇妙结合。尽管许多人因数学而望而却步,但逻辑过程往往更具挑战性—良好设计的原则,以及数据分析和解释背后的概念。如果统计学仅是简单数学的延伸,学习会更直接。期望与现实的差距导致许多问题,产生对统计学的厌恶、挫败感,甚至泪水。过去有人曾说:“事实是,我们大多数人讨厌统计分析,任何借口都愿意避免它”(Seddon,1937)。幸运的是,这并非必然。我希望本书采用的方法能使读者较为轻松地理解统计学。
I do not pretend that statistics is easy to learn. On the contrary, I think it is rather difficult. Statistics is a curious amalgam of mathematics, logic and judgement. Although many are put off by the mathematics, it is often the logical processes that cause more difficulty - the principles of good design, and the concepts underlying data analysis and interpretation. If statistics were what many people expect, namely an extension of simple mathematics, it would be more straightforward. The mismatch between expectation and reality leads to many problems, a dislike of the subject, frustration and maybe even tears. In the past it has led to remarks such as the following: 'The truth of the matter is that most of us detest statistical analysis and welcome any excuse to dispense with it' (Seddon, 1937). Fortunately this is not an inevitable pathway. I hope that the approach that I have adopted in this book leads to a relatively painless acquisition of an understanding of statistics.

2 数据类型 2 Types of data

2.1 引言 2.1 INTRODUCTION

统计学不仅仅是数据分析,后续章节我还将讨论诸如良好实验设计和结果解释等方面。然而,统计学作为一门学科,很大程度上是关于数据的,因此从简要讨论医学工作中可能遇到的各种数据类型开始是合理的。观察数据的性质对于选择正确的统计分析方法至关重要。数据可以是分类的或数值的(也称为定性和定量),但在这些大类之下,还有各种不同类型的数据。
There is a lot more to statistics than the analysis of data, and in later chapters I shall consider aspects such as the design of good experiments and the interpretation of results. Nevertheless statistics as a subject is very largely about data so it is sensible to start with a brief discussion of various types of data that may be encountered in medical work. The nature of the observations is of major importance in relation to the choice of correct statistical methods of analysis.Data can be either categorical or numerical (otherwise known as qualitative and quantitative), but within these broad classifications there are various different types of data.

数据可以是分类的或数值的(也称为定性和定量),但在这些大类之下,还有各种不同类型的数据。
Data can be either categorical or numerical (otherwise known as qualitative and quantitative), but within these broad classifications there are various different types of data.

2.2 分类数据 2.2 CATEGORICAL DATA

2.2.1 两类 2.2.1 Two categories

对个体最简单的观察类型是将该个体分配到仅有的两个可能类别之一。通常这些类别与某种属性的有无有关。患者分类的例子包括:
The simplest type of observation on an individual is the allocation of that individual to one of only two possible categories. Often these relate to the presence or absence of some attribute. Examples of such categorizations for patients include:

  1. 男/女

  2. male/female

  3. 怀孕/未怀孕

  4. pregnant/not pregnant

  5. 已婚/单身

  6. married/single

  7. 糖尿病患者/非糖尿病患者

  8. diabetic/non-diabetic

  9. 吸烟者/非吸烟者

  10. smoker/non-smoker

  11. 高血压患者/正常血压者

  12. hypertensive/normotensive

此类数据还有许多其他名称,如二元数据、二分数据、属性数据、是/否数据以及0-1数据。稍后我们将看到,将两个类别赋值为0和1在某些情况下具有一定优势。
Such data have numerous other names such as binary data, dichotomous data, attribute data, yes/no data, and 0- 1 data. We will see later that there are some advantages in giving the numerical values 0 and 1 to the two categories.Notice that whereas (1) and (2) above definitely split subjects into two groups the other examples are all simplifications of more complex data.

注意,虽然上述(1)和(2)明确将受试者分为两组,但其他例子都是对更复杂数据的简化。
Notice that whereas (1) and (2) above definitely split subjects into two groups the other examples are all simplifications of more complex data.

例如,若无进一步信息,如何将离婚者(第3条)或戒烟者(第5条)归类并不明确。将患者分类为高血压或非高血压(第6条)实际上是在测量值(此处为血压)上设定了一个截断点。一般来说,这种做法是不理想的,不仅仅是从统计学角度看。
For example, without further information it is not clear how to categorize people who have been divorced in (3) or ex- smokers in (5). The classification of patients as hypertensive or not (6) imposes a cut- off point on values of a measurement (here blood pressure). In general this is an undesirable practice, not always just from the statistical viewpoint.

2.2.2 多于两个类别 2.2.2 More than two categories

显然,许多分类需要多于两个类别,比如出生国家或血型。上一节中的例子(3)和(4)可以扩展为以下几个类别:
Clearly many classifications require more than two categories, such as country of birth or blood group. Examples (3) and (4) in the previous section might be expanded into several categories as follows:

已婚/单身/离婚/分居/丧偶
married/single/divorced/separated/widowed

青少年发病型糖尿病/成年发病型糖尿病/非糖尿病
juvenile- onset diabetes/maturity- onset diabetes/non- diabetic

另一个例子是血型:A/B/AB/O。这类数据也称为名义数据。
Another example is blood group: A/B/AB/O. Data of this type are also called nominal data.

在上述例子中,类别之间没有明显的顺序,但通常存在自然顺序,比如各种癌症(及其他疾病)的分期系统和社会阶层。回到吸烟量的例子,常将受试者分类为
In the above examples there is no obvious ordering of the categories, but often there is a natural order, as with the various staging systems for cancers (and other diseases) and social class. Returning to the example of cigarette consumption, it is common to classify subjects as

非吸烟者/前吸烟者/轻度吸烟者/重度吸烟者,
non- smokers/ex- smokers/light smokers/heavy smokers

吸烟程度还可以进一步细分。这类数据也称为有序数据。
where the degree of smoking could be subdivided further. Data of this type are also called ordinal data.

另一种有序分类数据来源于对无法测量事物的主观评估。例如,患者可能将其疼痛程度分类为
Another type of ordered categorical data arises with subjective assessment of something that cannot be measured. For example, a patient may classify their degree of pain as

轻微/中度/严重/难以忍受,
minimal/moderate/severe/unbearable

(详见第2.4.5节)。
(but see section 2.4.5).

有序数据常被简化为两个类别以便分析和展示,但这样做可能导致信息的显著丢失。
Ordinal data are often reduced to two categories to simplify analysis and presentation, which may result in a considerable loss of information.

2.3 数值型数据 2.3 NUMERICAL DATA

2.3.1 离散数据 2.3.1 Discrete data

离散数值数据产生于观察值只能取某些特定数值的情况。几乎所有例子都是事件计数,如子女数、一年内看全科医生的次数、24小时内异位心搏次数等。
Discrete numerical data arise when the observations in question can only take certain numerical values. Virtually all examples are counts of events, such as number of children, number of visits to the GP in a year, number of ectopic heart beats in 24 hours, etc.

通过考虑每种数据的示例,可以看出这类数据与前面描述的有序分类数据的区别:
The difference between such data as these and the ordered categorical data described earlier can be seen by considering an example of each:

有序分类:
Ordered categorical:

乳腺癌分期:I II III IV
Stage of breast cancer: I II III IV

离散数值:
Discrete numerical:

子女数:0 1 2 3 4 5+
Number of children: 0 1 2 3 4 5+

我们不能说IV期比II期严重两倍,也不能说I期与II期之间的差距等同于III期与IV期之间的差距。相比之下,三个孩子是一个孩子的三倍(虽然不一定是三倍严重!),且数值间差距为1在整个范围内意义相同。
We cannot say that stage IV is twice as bad as stage II nor that the difference between stages I and II is equivalent to that between stages III and IV. In contrast, three children are three times as many as one (although not necessarily three times as bad!), and a difference of one means the same throughout the range of values.

实际上,离散数据在统计分析中常被当作有序分类处理。这并非错误,但可能未能充分利用数据。反之,对于编号的有序分类,如疾病分期或社会阶层,必须避免将这些数字视为具有统计意义。例如,计算平均社会阶层或癌症分期是无意义的。数字所包含的唯一信息是顺序,这一点用A、B、C、D等字母表示同样可以传达。
In practice discrete data are often treated in statistical analyses as if they were ordered categories. This is not wrong, but it may not be getting the most out of the data. Conversely, where ordered categories are numbered, as with stage of disease or social class, the temptation to treat these numbers as statistically meaningful must be resisted. For example, it is not sensible to calculate the average social class or stage of cancer. The only information the numbers contain is in the ordering, which would be conveyed equally by calling them A, B, C, D and so on.

2.3.2 连续数据 2.3.2 Continuous data

连续数据通常通过某种测量获得。常见例子包括身高、体重、年龄、体温、血压和血清胆固醇。这类观察值不受限于特定数值,除非受测量仪器精度限制。
Continuous data are usually obtained by some form of measurement. Common examples include height, weight, age, body temperature, blood pressure and serum cholesterol. Such observations are not restricted to certain values except insofar as this is restricted by the accuracy of the measuring instrument.

没有必要将数据记录到过多的小数位,但原则上可以做到这一点,这正是连续测量的显著特征。因此,血压通常记录到最接近的2或5毫米汞柱,成人体重则记录到最接近的100克。
It will not be necessary to record the data to numerous decimal places, but the fact that in principle it could be done is the distinguishing property of continuous measurements. Thus blood pressure is often recorded to the nearest 2 or perhaps , and body weight of adults to the nearest .

有时将离散数据视为连续数据进行统计分析是合理的。虽然年龄是连续测量,但“上一个生日时的年龄”是离散的。在年龄范围为16至80岁的成人研究中,将年龄按年视为连续变量不会有害(这也是标准做法),但对于学龄前儿童,最好使用按月计算的年龄。心率(每分钟跳动次数)也是一种通常被视为连续的离散测量。虽然将离散数据视为连续数据的本质要求是存在大量不同的可能值,但实际上我们对将离散测量当作连续测量分析并不过于担心。
Sometimes it is reasonable to treat discrete data as if they were continuous, at least as far as statistical analysis goes. While age is a continuous measurement, age at last birthday is discrete. In studies of adults with ages ranging from, say, 16 to 80, no harm is done in considering age in years as a continuous measurement (and this is standard practice), but for studies of pre- school children it would be better to use age in months. Heart rate (in beats per minute) is another discrete measurement that is usually regarded as continuous. Although the essential requirement for this change of status is that there should be a large number of different possible values, in practice we do not worry too much about analysing discrete measurements as if they were continuous.

相反,连续数据常被简化为几个类别。
Conversely, continuous data are often reduced to several categories.

当变量已知不精确时,例如每天吸烟的报告数量,使用诸如以下的类别可能是合理的:
Where the variable is known to be imprecise, such as reported number of cigarettes smoked per day, it may be sensible to have categories such as

否则,最好记录血压、血红蛋白等的实际数值。分析时可以轻松转换为类别,但如果只记录类别,原始数据将无法恢复。这样会导致信息丢失且无补偿收益。事实上,连续数据的统计分析更具效力,且通常更简单。
Otherwise, it is best to record the actual value of blood pressure, haemoglobin, etc. It is easy to convert to categories in the analysis, but the raw data cannot be retrieved later if only categories are recorded. Information is lost with no compensatory gain. Indeed, the statistical analysis of continuous data is more powerful, and often simpler.

当需要通过计算得出感兴趣的观察值时,应由计算机完成。因此,最好记录出生日期和检查日期以便后续计算年龄,而不是依赖人工心算。
When some calculation is necessary to derive the observation of interest this should be done by the computer. Thus it is much better to record date of birth and date of examination for subsequent calculation of age rather than to rely on mental arithmetic.

测量精度和数据类型对于进行恰当的统计分析都非常重要。
The degree of measurement accuracy and the type of data are both important in relation to carrying out a proper statistical analysis.

2.4 其他类型的数据 2.4 OTHER TYPES OF DATA

前述章节涵盖了医学研究中最常见的数据类型。本节将介绍一些杂项的其他数据类型。
The preceding sections have covered the most common types of data likely to be encountered in medical research. In this section some miscellaneous other types of data are described.

2.4.1 排名 2.4.1 Ranks

有时,所讨论的数据是某个群体成员在某方面的相对位置。最明显的例子(虽然非医学领域)是在体育比赛或考试中。有时存在明确的基础测量,如跑400米的时间,但在其他情况下则没有,例如表达对不同治疗方案的偏好时。
Occasionally the data in question are the relative positions of the members of a group in some respect. The most obvious (although non- medical) example is in sporting competitions or examinations. Sometimes there is a clear underlying measurement, such as time to run 400 metres, but in other cases there is not, for example when expressing preferences between different treatments.

有时患者会接受两种或多种治疗,并被要求表达偏好。这类排名在医学工作中较为罕见,但这一思想非常重要。正如我们将在后续章节看到的,在某些情况下,将一组个体的测量值转换为排名顺序后再进行数据分析是一个好主意。
Patients are sometimes given two or more treatments and asked to express a preference. Such rankings are rare in medical work, but the idea is important. As we shall see in later chapters, in some circumstances it is a good idea to convert the measurements on a group of individuals into a rank ordering before analysing the data.

2.4.2 百分比 2.4.2 Percentages

当取两个量的比值时,就会产生百分比。例如左心室射血分数,它衡量心脏跳动时从左心室射出的血液百分比;还有相对体重(观察到的体重除以“理想”体重)。在第一个例子中,比值是两个均已测量的量,而在第二个例子中,则是单个测量值除以一个预先存在的(常数)值,通常取自已发布的表格。
Percentages arise when one takes the ratio of two quantities. Examples are the left ventricular ejection fraction, which measures the percentage of blood ejected from the left ventricle when the heart beats, and the relative body weight (observed body weight divided by 'desirable' body weight). In the first example the ratio is of two quantities both of which have been

measured, while in the second a single measurement is divided by a pre- existing (constant) value usually taken from published tables.

虽然使用这些计算得出的百分比来表示已确立的测量结果是合理的,但通常更希望保留计算中涉及的两个数量的信息。例如,仅记录每个人治疗后血压降低的百分比并不是一个好主意。没有特别的理由必须用百分比降低来评价药物的有效性。
Although it is reasonable to use these calculated percentages for well- established measurements, it is in general desirable to retain the information regarding both quantities used in the calculation. It would not, for example, be a good idea to record for each individual just the percentage reduction in blood pressure achieved following treatment. There is no particular reason to consider the effectiveness of a drug in terms of percentage reduction.

虽然百分比通常被视为连续测量值,但它们在分析中可能引发问题,尤其是在数值可能超过或低于 (例如相对体重),或在计算某些测量值的百分比变化时可能出现负值的情况下。如果你的收缩压是 ,那么 的升高会使其达到 ,但随后 的下降会使其回落到 。处理此类数据时需格外小心。
Although percentages may usually be regarded as continuous measurements they can cause problems in analysis, especially where there can be values either side of (e.g. relative weight), or where there can be negative values as when calculating the percentage change in some measurement. If your systolic blood pressure is then a rise will increase it to , but a subsequent fall of will take it back down to . Considerable care is necessary when considering such data.

2.4.3 率和比率 2.4.3 Rates and ratios

将观察到的频率转换为率时采用类似的方法。例如,围产期死亡人数通常通过计算每1000个出生婴儿的围产期死亡率与出生总数相关联。
A similar approach is used to convert an observed frequency to a rate. For example, the number of perinatal deaths is usually related to the total number of births by calculating the perinatal mortality rate per 1000 births.

有时会将特定事件的发生频率与预期事件数进行比较。例如,可以通过将国家按年龄和性别划分的发病率应用于该地区各年龄性别组的人数,计算该地区在特定时间段内白血病新发病例的预期数。观察频数 与预期频数 的比值给出标准化死亡比率,计算公式为
Sometimes the frequency of events of a specific kind is compared with the expected number of events. For example, the expected number of new cases of leukaemia in an area in a given time period can be calculated by applying national age and sex specific rates to the numbers of people in the area in each age sex group. The ratio of the observed to expected frequencies yields the standardized mortality ratio as .

2.4.4 评分 2.4.4 Scores

当无法进行直接测量时,通常可以以某种方式对个体进行分级。最简单的形式可能是将皮疹分类为轻度、中度或重度。例如,临床医生通常使用诸如 这样的系统。虽然这些符号的含义相当明显,但分类通常没有明确定义,且不同医生之间不可比。显然,这种简单的量表是有序分类数据的又一例子。
When it is not possible to take direct measurements it is often possible to grade individuals in some way. In its simplest form, such a system may involve classifying a skin rash, for example, as mild, moderate or severe. More generally clinicians often use systems such as . Although the meaning of such symbols is pretty obvious, the classes are usually undefined and will not be comparable from one doctor to another. Clearly, such simple scales are further examples of ordered categorical data.

然而,通常可以根据多种方式对患者进行分类,可能涉及不同的症状和体征。对于每个症状,不同的编码可以赋予数值,然后将各数值相加得到总分。这个总分即为观察值。
Often, however, it is possible to classify patients in several ways, perhaps in relation to various symptoms and signs. For each symptom the different codings can be given numerical values and the various values added up to give a total score. This score is then the observation.

表 2.1 新生儿阿普加评分系统
Table 2.1 Apgar system of scoring newborn babies

体征评分
012
心率缓慢(< 100)> 100
呼吸努力哭声弱;通气不足哭声响亮有力
肌张力松弛四肢稍屈曲屈曲良好
反射兴奋性(对足部皮肤刺激的反应)无反应有动作哭泣
肤色发绀;苍白躯干粉红,四肢发绀完全粉红
SignScore
012
Heart rateAbsentSlow (&lt; 100)&gt; 100
Respiratory effortAbsentWeak cry; hypoventilationGood strong cry
Muscle toneLimpSome flexion of extremitiesWell flexed
Reflex irritability (response to skin stimulation to feet)No responseSome motionCry
ColourBlue; paleBody pink; extremities blueCompletely pink

一个著名的例子是阿普加评分,用于评估新生儿的健康状况(Apgar,1953)。表 2.1(摘自 Apgar 等,1958)展示了“阿普加评分”的获得方法。新生儿根据五个变量中的每个变量被分为得分 0、1 或 2 的三类之一,因此总分介于 0 到 10 之间。通常在所有新生儿出生后 1 分钟和 5 分钟时计算阿普加评分。1 分钟时得分 7 分或以上为良好,少于 3 分则非常差。
A well- known example is the Apgar score for evaluating the well- being of newborn babies (Apgar, 1953). Table 2.1 (from Apgar et al., 1958) shows how the 'Apgar score' is obtained. Infants are classified into one of three categories scored 0, 1 or 2 for each of five variables, and thus receive a total score of between 0 and 10. It is standard practice to calculate Apgar scores in all newborn babies at both one and five minutes after birth. At one minute a score of 7 or more is good, whereas a score of less than 3 is very bad.

这里不讨论该评分系统的实用性或有效性,但应注意该系统的三个典型特点。首先,大多数体征的评分涉及一定的主观性。其次,数值编码暗示从 0 到 1 或从 1 到 2 的差异同等重要。第三,五个体征被视为同等重要。因此,复合评分包含较大的主观性,既来自组合过程本身,也来自个体评估。
This is not the place to discuss the usefulness or validity of this particular scoring system, but three aspects of the system, which is typical of such schemes, should be noted. Firstly, for most of the signs some subjectivity is involved. Secondly, the numerical coding implies that any difference from 0 to 1 or from 1 to 2 is equally important. Thirdly, the five signs are considered equally important. Composite scores thus incorporate considerable subjectivity, some inherent in the combination procedure and some in the assessment of individuals.

在非医学领域,对花样滑冰锦标赛中不同项目赋予的权重存在较大争议,十项全能的评分系统也在调整,因为某些项目成绩的进步导致其他项目被低估。同样的问题也出现在不同考试成绩的合并中。复合评分中各组成部分的权重不必相等,尽管临床实践中通常是相等的。
In a non- medical field there has been considerable controversy over the relative weights given to the different events in ice- skating championships, and the scoring system for the decathlon is being changed because advances in achievement in some events have tended to undervalue other events. The same problem occurs in combining marks from different exams. The weighting of constituent elements of a composite score does not have to be equal, although it usually is in clinical practice.

2.4.5 视觉模拟量表 2.4.5 Visual analogue scales

患者可能被要求评估他们某种无法测量的程度,如疼痛、活动能力或饥饿感。一种改进有序分类的方法是
Patients may be asked to assess their degree of something unmeasurable like pain, mobility or hunger. A technique for improving on ordered

类别(在第2.2.2节中有示例)的一种改进方法是视觉模拟量表(VAS)或线性模拟量表。患者会看到一条直线(通常长约 ),线的两端标有极端状态。患者被要求在这条线上标出代表其当前状态感知的位置。术后疼痛的VAS可能如下所示:
categories (illustrated in section 2.2.2) is the visual analogue scale (VAS) or linear analogue scale. The patient is shown a straight line (often long) the ends of which are labelled with extreme states. They are asked to mark the point on the line which represents their perception of their current state. A VAS for post- operative pain might look like

无痛|- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 难以忍受的疼痛 患者标记处
no pain|- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - unbearable pain patient's mark

其中 表示患者自我评估所在的刻度位置。由于此类评估显然高度主观,这些量表在观察个体内变化时最有价值。我们无法对例如2.2分(从刻度最左端测量)赋予绝对含义,但同一患者的评分下降到1.4则可解释。在处理此类数据时需谨慎。例如,我们可能更倾向于采用基于评分排序而非精确数值的分析方法。
where indicates the place on the scale where the patient judges himself to be. As such assessments are clearly highly subjective, these scales are of most value when looking at changes within individuals. We cannot put any absolute meaning on a score of, say, 2.2 (measured from the leftmost end of the scale), but a reduction to 1.4 in the same patient is interpretable. Caution is required in handling such data. We might, for example, prefer a method of analysis that is based on the rank ordering of scores rather than their exact values.

2.5 截尾数据 2.5 CENSORED DATA

如果观察值无法精确测量,但知道其超出某个界限,则称该观察为截尾数据。常见产生截尾数据的情况包括:
An observation is called censored if we cannot measure it precisely but know that it is beyond some limit. Common situations often producing censored data are:

  1. 测量血液中某些微量成分时,实际水平可能低于检测仪器或测试方法的最低检测限,尽管已知该值应大于零。此类值称为不可检测,但被认为是在检测限处截尾。由于惯例是将低值绘制在水平刻度的左侧,这也称为左截尾。

  2. When measuring some trace constituent of blood the actual level may be below the lowest level that the machine or test can detect, even though it is known that the value should be greater than zero. Such values are termed non-detectable but are said to be censored at the limit of detectability. Because the convention is to plot data with low values to the left of a horizontal scale, this is also known as left censoring.

  3. 在一些实验中(通常是动物实验),有固定的随访期限。在此期间,研究者可能关注某种特定状况的出现或消失,观察指标是从实验开始到事件发生的时间。如果在实验结束时事件尚未发生,则该观察在该时间点被(右)截尾。同样,在长期临床试验中,关注的结果通常是生存时间。试验通常在招募开始后固定时间停止,因此试验结束时仍存活的患者的生存时间均为截尾数据,截尾时间因患者入组时间不同而异。

  4. In some experiments, often with animals, there is a fixed length follow-up period. During this period the investigators may be looking for the appearance or perhaps disappearance of some specific condition, where the observation of interest is the time taken from the start of the experiment. Where nothing has happened by the end of the experiment, those observations are (right) censored at that time. Likewise, in long-term clinical trials the outcome of interest is often length of survival. Here the trial will usually stop at a fixed time after recruitment to the trial began, so that patients still alive at the end of the trial all have censored survival times, with the censoring being after different times of observation depending on how long the patient had been in the trial.

第13章将介绍分析截尾生存数据的专门技术。
Special techniques for analysing censored survival data will be described in Chapter 13.

2.6 变异性 2.6 VARIABILITY

统计学很大程度上关注变异性;在医学研究中,这通常指人群间的变异性。有时变异性本身是主要关注点,例如描述一组健康受试者某项测量的可能取值范围。然而,更多情况下我们关注的是可能被变异性掩盖的潜在趋势。例如,在比较两种治疗对不同患者组的效果时,患者对特定治疗的反应可能存在较大差异。变异性的概念是统计学的基础,并将在本书中反复出现。
Statistics is largely about variability; in medical research this is often the variability between people. Sometimes it is the variability itself that is of prime interest, such as when describing the likely values of some measurement in a group of healthy subjects. Often, however, we are more interested in detecting underlying trends which may be obscured by variability. For example, when comparing two treatments on different groups of patients there may be considerable variation in the way patients respond to a particular treatment. The concept of variability is fundamental in statistics, and will recur throughout this book.

我们用“变量”一词来表示数据集中任何会变化的事物。虽然许多变量与人类受试者(或动物)有关,但如果研究的是国家间的差异(例如围产期死亡率)、比较小群体个体的特征,或观察同一受试者在不同条件下测量的变异性,同样的考虑也适用。
We use the term variable to denote anything that varies within a set of data. Although many variables relate to human subjects (or perhaps animals), the same considerations apply if one is studying variation from country to country (for example in perinatal mortality rates), comparing characteristics of small groups of individuals, or looking at variability in measurements of the same subject under different conditions.

本章前面给出的所有数据示例都被称为变量。
All the examples of data given earlier in this chapter are called variables.

2.7 数据类型的重要性 2.7 IMPORTANCE OF THE TYPE OF DATA

刚介绍的多种数据类型都可以用统计方法进行分析,但数据类型对确定适当(且有效)分析方法至关重要。在许多医学研究中,会收集多种类型的变量,因此可能需要多种不同的分析方法。第6章我将提供如何记录数据以便后续分析的建议。
The many types of data just introduced can all be analysed by statistical methods, but the type of data can be critically important in determining which methods of analysis will be appropriate (and valid). In many medical studies variables of several types are collected, so that several different analytic methods may be needed. In Chapter 6 I shall give advice on how to record data for subsequent analysis.

大多数统计方法针对特定类型的数据,不同数据类型需要不同的技术。然而,最主要的区分是连续变量和分类变量。此外,对于连续变量或有序分类变量,还可以使用适用范围更广的秩次法。
Most statistical methods are specific to a certain type of data, with alternative techniques needed for different data types. The major distinction, however, is that between continuous and categorical variables. Further, for continuous or ordered categorical variables there is also the possibility of using alternative rank methods which are of much wider applicability.

这些分析方面将在全书中反复出现。使用适合数据类型的分析方法至关重要。
These aspects of analysis will feature throughout this book. It is essential to use a method of analysis that is appropriate for the type of data.

2.8 数字处理 2.8 DEALING WITH NUMBERS

2.8.1 统计分析 2.8.1 Statistical analysis

分析数据时的原则是使用记录数据的全部精度。中间结果不应进行任何“舍入”(见下文)。如果在计算机上进行分析,上述过程会自动完成。在计算器上,只有当中间计算结果被存储在内存中时才会如此。
When analysing data the rule is to use the full precision of the recorded data. There should not be any 'rounding' of intermediate results (see below). If you carry out your analysis on a computer the procedure just described will happen automatically. On a calculator it will happen only if intermediate calculations are stored in memory.

2.8.2 结果呈现 2.8.2 Presenting results

关于结果展示的建议出现在许多后续章节,但对数字展示的一些一般性介绍性评论可能会有所帮助。
Advice on presentation of results appears in many later chapters, but some general introductory comments on presenting numbers may be helpful.

类别数据的分析通常会得到出现次数的计数,例如不同血型受试者的数量及相应的百分比。如果像通常所希望的那样给出计数,则百分比不需要非常精确。例如,将45人中17人表示为37.78%并非必要,若同时给出原始数字,37.8%到38%之间即可。位数过多的数字更难理解。百分比在样本量非常小时可能会产生误导—当你指的是4人中1人时,不建议说“25%的患者对治疗反应良好”。
Analysis of categorical data often leads to counts of occurrences, such as the numbers of subjects in different blood groups, together with the corresponding percentages. If, as is usually desirable, the counts are given the percentages do not need to be given very precisely. Thus, for example, it is not necessary to express 17 out of 45 as or even is sufficient if the raw numbers are given too. Numbers with many digits are much harder to assimilate. Percentages may mislead in very small samples - saying that of patients responded well to the treatment is not recommended when you mean one out of four patients.

连续数据的分析会产生许多小数位的结果,如平均舒张压为85.348074 mmHg。此类结果显然应通过四舍五入进行简化(见下一节),同时考虑原始数据的准确度。在此例中,将平均血压报告为85.3 mmHg不会丢失重要信息。
The analysis of continuous data will lead to results that have many decimal places, such as an average diastolic blood pressure of . Results like this clearly should be shortened by rounding (see next section), bearing in mind the accuracy of the original data. In this example no important information would be lost if the average blood pressure was reported as .

对于小于1.0的数字,建议在小数点前保留一个零—例如0.729而非.729。
For numbers less than 1.0 a zero before the decimal point is preferable - - thus 0.729 rather than .729.

通常最好将所有可比结果保留相同的小数位数。
It is usually best to quote all comparable results to the same number of decimal places.

2.8.3 四舍五入数字 2.8.3 Rounding numbers

如果我们希望将数字如85.348074报告为更少的小数位数,采用一个简单的四舍五入规则。规则是:如果第一个被舍弃的数字小于5,则直接舍弃后续数字;否则将最后保留的数字加一。因此,将85.348074四舍五入到三位小数为85.348,四舍五入到两位小数为85.35。如果被舍弃的是单独的5或5后跟零,有些人建议四舍五入到最近的偶数,有些人则总是向上舍入。例如,将17.75四舍五入到一位小数为17.8,但16.85可能是16.8或16.9,取决于你的偏好。数字末尾的零应保留。因此,将28.402或28.399四舍五入到两位小数均为28.40。
If we wish to report a number such as 85.348074 to fewer decimal places, we use a simple rule for rounding. The rule is that excess digits are simply discarded if the first of them is less than five. Otherwise the last retained digit is increased by one. So rounding 85.348074 to three decimal places gives 85.348, while rounding to two decimal places gives 85.35. If the discarded information is a solitary 5 or a 5 followed by zeros some people recommend rounding to the nearest even digit, while others always round upwards. Thus rounding 17.75 to one decimal place gives 17.8, but 16.85 will give 16.8 or 16.9 depending upon your preference. Zeros on the end of a number should be retained. Thus if we round 28.402 or 28.399 to two decimal places we get 28.40.

注意避免对同一数字重复四舍五入,这可能导致错误。将85.348074四舍五入到两位小数得85.35,再将其四舍五入到一位小数则为85.4,而正确值应为85.3。
Beware of rounding the same number twice, which can lead to errors. If 85.348074 is rounded to two decimal places we get 85.35. If we then decide to round this value to one decimal place we get 85.4 rather than the correct value of 85.3.

四舍五入应仅用于最终展示—分析过程中应保留全部精度。
Rounding should not be used until the final presentation - full precision should be retained during the analysis.

3 描述数据 3 Describing Data

3.1 引言 3.1 INTRODUCTION

如果说统计学的核心概念只有一个,那就是变异性。在医学中,这一点最明显地体现在人们在生理、生化及其他特征上的差异,以及他们对疾病和治疗的不同反应。我们还经常遇到本应相同的仪器之间的变异,以及不同观察者之间的差异。有时多种变异源同时存在。例如,当我的全科医生测量我的血压时,记录的数值很大程度上取决于某个未知的“真实”值,但也与测量时间、我是否迟到并匆忙赶到诊所、所用血压计的类型、我是否对结果感到焦虑等因素有关。当许多人测量血压时,年龄、性别和种族等因素会影响个体间的变异。
If there is one key concept underlying the subject of statistics, it is that of variability. In medicine we can see this most obviously in the way people differ in their physiological, biochemical and other characteristics and also in their variable responses to disease and to therapy. We also often encounter variability between machines that are supposed to be identical, and between different observers. There are sometimes many sources of variability present at once. For example, if I have my blood pressure measured the value recorded by my GP will depend greatly on some unknown underlying 'true' value, but it will also relate to the time of day, whether I was late and had to run to the surgery, the type of sphygmomanometer being used, whether I was anxious about the outcome, and so on. When many people have their blood pressure measured other factors will affect between- subject variability, such as age, sex and race.

一般来说,我们可以将变异分为已知原因引起的和未解释的两类。例如,在一项针对25至65岁男性的研究中,部分血压变异可归因于年龄,但大部分变异则无法解释。我们通常将这种未解释的变异称为随机变异。
In general we can divide variability into that due to known causes and that which is unexplained. Thus, for example, in a study of men aged 25 to 65 part of the variability in their blood pressures may be ascribed to their age, but most of the rest is unexplained. We often refer to this unexplained variability as random variation.

在任何研究中,我们通常希望以简单的方式总结部分数据。有时这就是统计分析的全部内容,但通常它是第一步。对于类别变量,如性别和血型,直接呈现各类别的数量是很简单的,通常还会显示该类别占总患者数的频率或百分比。图形展示时称为条形图。图3.1显示了1974年按职业划分的一般航空事故率条形图(Booze,1977)。类似的图表也可用来显示频率(或率)与另一变量值的关系。例如,图3.2展示了1979年英格兰和威尔士按星期几划分的每千次出生围产期死亡率,明显看到周末死亡率较高。条形图的纵轴必须从零开始,否则视觉效果会误导,夸大组间差异。
In any study we will usually want to summarize some of the data in a simple way. Sometimes this will be as far as the statistical analysis goes, but often it is a first step. For categorical variables, such as sex and blood group, it is straightforward to present the number in each category, usually indicating the frequency or percentage of the total number of patients. When shown graphically this is called a bar diagram. Figure 3.1 shows a bar diagram of general aviation accident rates in 1974 by occupation (Booze, 1977). A similar diagram can also be used to relate frequencies (or rates) to values of another variable. For example, Figure 3.2 shows perinatal mortality per 1000 births in England and Wales in 1979 by day of the week. The higher mortality rates at the weekend are clearly seen. It is very important that the vertical axis of a bar diagram starts at zero, otherwise the visual impression is misleading, with the differences between groups being exaggerated.


图3.1 1974年按职业划分的一般航空事故率条形图(每千次)(Booze,1977)。
Figure 3.1 Bar diagram showing general aviation accident rates (per 1000) in 1974 by occupation (Booze, 1977).


图3.2 1979年英格兰和威尔士按星期几划分的围产期死亡率(Macfarlane和Mugford,1984)。
Figure 3.2 Perinatal mortality in England and Wales in 1979 by day of the week (Macfarlane and Mugford, 1984).

对于连续变量,如年龄和血清胆红素,观察值种类繁多,因此需要另一种方法。本章余下部分将重点介绍用数值和图形方式描述和总结这类数据的方法。
For continuous variables, such as age and serum bilirubin, there will be a large number of different observed values, so an alternative approach is needed. The remainder of this chapter concentrates on ways of describing and summarizing such data both numerically and graphically.

本章我将首次引入一些数学符号。
In this chapter I shall introduce some mathematical notation for the first

关于这些符号的进一步解释可见本书末尾的附录A。
time. Further explanation of this notation can be found in Appendix A at the end of the book.

3.2 平均值 3.2 AVERAGES

描述一组连续变量的观测值时,显而易见的第一步是计算平均值。在日常用语中,“平均”一词并无精确定义,但在统计学中,有几种所谓的“集中趋势测度”被精确定义,可以作为平均或典型值。
The obvious first step when describing a set of observations of a continuous variable is to calculate the average value. In colloquial use the word 'average' does not have a precise meaning, but in statistics there are several so- called 'measures of central tendency' that are precisely defined and which can be taken as the average or typical value.

其中最常见的是算术平均数,通常简称为均值,即所有观测值之和除以观测值的数量。表3.1显示了25名囊性纤维化患者的年龄和肺功能数据。所示变量是最大静态吸气压(PImax)。
The most common of these is the arithmetic mean, usually just called the mean, which is the sum of all the observations divided by the number of observations. Table 3.1 shows age and lung function data for 25 patients with cystic fibrosis. The variable shown is the maximal static inspiratory

表3.1 25名囊性纤维化患者的年龄和PImax(O'Neill等,1983年)
Table 3.1 Age and PImax in 25 patients with cystic fibrosis (O'Neill et al., 1983)

受试者年龄 (岁)PImax (cm H2O)
1780
2785
38110
4895
5895
69100
71145
81295
912130
101375
111380
121470
131480
1415100
1516120
1617110
1717125
181775
1917100
201940
211975
2220110
2323150
242375
252395
SubjectAge (years)PImax (cm H2O)
1780
2785
38110
4895
5895
69100
71145
81295
912130
101375
111380
121470
131480
1415100
1516120
1617110
1717125
181775
1917100
201940
211975
2220110
2323150
242375
252395

最大静态吸气压(PImax)是呼吸肌力量的指标。PImax值之和为2315,因此均值为 。通常提到“平均值”时指的就是均值。均值有时用 (读作“x bar”)表示,但除非在公式中,否则最好避免使用这种简写。
pressure (PImax) and is an index of respiratory muscle strength. The sum of the PImax values is 2315, so the mean is . The mean is the value usually meant when talking about 'the average'. The mean is sometimes indicated by (pronounced 'x bar'), but this shorthand notation is best avoided other than in equations.

另一种常用的测度是中位数。中位数是将数据排序后处于中间位置的值。对于表3.1中的PImax数据,共有25个观测值,因此中位数是第13个值。将PImax值按升序排列,得到:
The other frequently used measure is the median. This is the value that comes half- way when the data are ranked in order. For the PImax data in Table 3.1 there are 25 observations, so the median is the 13th value in order. If we rank the PImax values in ascending order we get

排名12345678910111213
PImax40457075757575808080859595
排名141516171819202122232425
PImax9595100100100110110110120125130150
Rank12345678910111213
PImax40457075757575808080859595
Rank141516171819202122232425
PImax9595100100100110110110120125130150

我们可以看到中位数是 。更简单的是,我们可以直接从表3.1中看到这些患者的中位年龄是14岁。当观察值为偶数时,中位数定义为两个中间值的平均数:如果有24个观察值,中位数就是排序后第12个和第13个值的平均数。通常,中位数的上下两边观察值数量相等。然而,当有多个观察值等于中位数时,如PImax数据,这种情况可能不完全成立。
and we can see that the median is . More easily, we can see immediately from Table 3.1 that the median age of these patients was 14 years. When there is an even number of observations the median is defined as the average of the two central values: if we had 24 observations the median would be the average of the 12th and 13th values in an ordered listing of the observations. There are usually equal numbers of observations above and below the median. However, when there is more than one observation equal to the median, as for the PImax data, this may not be exactly true.

当一些极端数据值被截断时,中位数特别有用。如果观察值在超过某一水平或低于检测限时未被精确记录,我们无法计算均值,但只要有超过一半受试者的确切值,就可以计算中位数。中位数在生存时间分析中也非常有价值,这将在第13章讨论。
The median is especially useful when some extreme data values are censored. If observations are not recorded precisely when they are above a certain level or below a level of detection, we cannot calculate the mean, but we can calculate the median if we have definite values for over half the subjects. The median is also valuable in the analysis of survival times, which is considered in Chapter 13.

均值和中位数都是描述一组数据平均或典型值的常用统计量。均值使用更为广泛,因为它与最常见的统计分析方法相契合,但中位数作为描述统计量并不逊色,在某些情况下比均值更有用,后文将详细说明。在某些情况下,我们还计算一种称为几何均值的指标,它通常接近中位数。几何均值的使用在3.4.4节中描述。
The mean and the median are both widely used to describe the average or typical value of a set of data. The mean is much more frequently used because this ties in well with the most common types of statistical analysis, but the median is in no way inferior as a descriptive statistic and in some circumstances it is much more useful than the mean, as we shall see later. In some situations we calculate another measure known as the geometric mean, which is usually close to the median. Its use is described in section 3.4.4.

描述数据中心的最后一个指标是众数,即最常见的观测值。众数在连续数据中很少有实际用途。
A final indicator of the centre of a set of data is the mode which is simply the most common value observed. The mode is rarely of any practical use for continuous data.

3.3 描述变异性 3.3 DESCRIBING VARIABILITY

描述一组连续变量观测值的第二个方面
The second aspect of describing a set of observations of a continuous

变量的目的是以某种方式评估观察值的变异性。任何一组数据都会包含许多不同的数值,例如上文所示的PImax数据。我们关心的是这些数值的分布情况—它们是都相似,还是变化很大?解决这个问题有多种方法。首先我将介绍图形方法,然后再考虑数值方法。
variable is to assess the variability of the observations in some way. Any set of data will contain many different values, for example the PImax data shown above. We are interested in the way these values are distributed - - are they all similar or do they vary a lot? There are several ways of tackling this problem. I shall look first at graphical methods, and then consider numerical methods.

3.3.1 直方图 3.3.1 Histogram

一种简单的图形方式来描述一组完整的观测数据是使用直方图,在直方图中,不同数值或数值组的观测次数(或频率)被绘制出来。表3.2显示了298名6个月至6岁健康儿童免疫球蛋白IgM的频数分布,图3.3则展示了该数据的直方图。
A simple graphical way of depicting a complete set of observations is by means of the histogram in which the number (or frequency) of observations is plotted for different values or groups of values. Table 3.2 shows the frequency distribution of the immunoglobulin IgM in 298 healthy children aged 6 months to 6 years, and Figure 3.3 shows a histogram of

表3.2 298名6个月至6岁儿童血清IgM浓度(Isaacs等,1983年)
Table 3.2 Concentrations of serum IgM in 298 children aged 6 months to 6 years (Isaacs et al., 1983)

IgM (克/升)儿童人数
0.13
0.27
0.319
0.427
0.532
0.635
0.738
0.838
0.922
1.016
1.116
1.26
1.37
1.49
1.56
1.62
1.73
1.83
2.03
2.12
2.21
2.51
2.71
4.51
IgM (g/l)Number of Children
0.13
0.27
0.319
0.427
0.532
0.635
0.738
0.838
0.922
1.016
1.116
1.26
1.37
1.49
1.56
1.62
1.73
1.83
2.03
2.12
2.21
2.51
2.71
4.51


图3.3 298名6个月至6岁儿童血清IgM浓度频率直方图(Isaacs等,1983)。
Figure 3.3 Frequency histogram of IgM concentrations in 298 children aged 6 months to 6 years (Isaacs et al., 1983).

这些数值。如果数值很多,通常需要先将观察值分组,再绘制直方图,以获得更好的视觉效果。除非样本非常大,一般8到15组就足以满足良好的展示效果。具体组数取决于数据本身,且分组应尽量简单。虽然我们可以将IgM数据按0.25的区间分组,但这超出了数据的精确度。更好的做法是按0.2的区间分组,如图3.4所示。注意,每个竖条的宽度覆盖了被分组的数值范围。例如,当我们将0.1和0.2分为一组时,实际上包括了0.05至0.25之间的数值,尽管数据记录并不那么精确。直方图类似于条形图,但由于频率对应的是连续变量,直方图中相邻的条形应当相连。
these values. If there are many different values it is often desirable to group observations before constructing a histogram in order to get a better visual impression. Unless the sample is very large somewhere around 8 to 15 groups will usually suffice for a satisfactory display. This will depend upon the actual data, for it is desirable to keep the groupings simple. Although we could group the IgM data in intervals of, say, 0.25, this goes beyond the precision of the data. Better is the grouping in intervals of 0.2 shown in Figure 3.4. Note that the width of each vertical bar covers the range of values that have been grouped. So, for example, when we group 0.1 and 0.2 we are actually including values between 0.05 and 0.25 even though the data were not recorded that accurately. A histogram is similar to a bar diagram, but because the frequencies relate to a continuous variable adjacent bars of a histogram should touch.

直方图中的条形通常宽度相同,因为分组大小一致。如果分组大小不一,则应考虑条形的面积与频率成正比,而非高度。这个原则在1985年伦敦哈罗区交通事故受害者年龄分布数据中得到了体现。表3.3显示了这些数据。大多数受害者为成年人,且25至59岁年龄段人数最多。显然,分组宽度差异较大,范围从1岁到35岁不等,绘制直方图时必须考虑这一点。注意,为了在直方图中包含60岁以上组,我们需假设一个合理的最大年龄,这里取80岁。
The bars in histograms are usually all the same width, because the groupings are the same size. If the groups are not the same size this should be allowed for by remembering that it is the area of each bar that is proportional to the frequency, not its height. This principle is illustrated on data showing the age distribution of road accident casualties in the London borough of Harrow in 1985. Table 3.3 shows the data as presented. Most of the casualties were adults, with the greatest number in the age range 25 to 59. Clearly the widths of the groupings vary considerably, from 1 to 35 years in fact, and this must be taken account of in a histogram of the data. Note that in order to include the age group in a histogram we have to assume a reasonable upper age limit - here it will be taken as 80.


图3.4 与图3.3类似,但数据按0.2克/升区间分组。
Figure 3.4 As Figure 3.3 but data grouped in intervals of .

表3.3 1985年伦敦哈罗区交通事故受害者年龄分布(不包括65名年龄不详者)
Table 3.3 Road accident casualties in the London Borough of Harrow in 1985 (excluding 65 with unknown age)

年龄频数
0-428
5-946
10-1558
1620
1731
18-1964
20-24149
25-59316
60+103
合计815
AgeFrequency
0- 428
5- 946
10-1558
1620
1731
18-1964
20-24149
25-59316
60+103
Total815

首先,考虑如果忽略上述警告,绘制一个直方图,其中每个年龄组的高度表示表3.3中的频数,宽度表示年龄范围,如图3.5所示。该直方图暗示16岁和17岁受害者数量远少于成年人,而我们可能预期情况正好相反。通过让频数对应条形面积而非高度,我们得到了正确的图像,如图3.6所示。这里我们考虑的是每岁年龄的受害者人数—当没有明确数据时,我们采用
First, consider what happens if we ignore the above warning and draw a histogram where, for each age group, the height indicates the frequency shown in Table 3.3 and the width shows the age range - this is shown in Figure 3.5. This histogram suggests that accident victims are much less likely to be 16 and 17 year olds than adults, whereas we would probably expect the opposite to be true. We get the correct picture by making the frequencies correspond to the area of each bar rather than its height, as is shown in Figure 3.6. What we have done is consider the number of casualties per year of age - where we don't have this explicitly we take the


图3.5 表3.3交通事故数据的错误直方图。
Figure 3.5 Incorrect histogram of road accident data of Table 3.3.


图 3.6 道路交通事故数据的正确直方图。
Figure 3.6 Correct histogram of road accident data.

该年龄组的平均值。图 3.6 展示了数据的真实印象,我们可以看到,16 至 24 岁年龄段的交通事故伤亡者比其他任何年龄组都更为常见。
average value in that age group. Figure 3.6 shows a true impression of the data, from which we can see that road accident casualties are more likely to be aged 16 to 24 than any other age group.

注意,这个直方图仅显示了观察到的伤亡人数。它并不表示不同年龄段人群发生交通事故的风险—为此,我们还需要了解人口的年龄分布,并假设所有伤亡者均居住在哈罗区,且哈罗区居民没有在其他地方发生事故。
Note that this histogram just shows the observed numbers of casualties. It does not indicate the risk of a road accident for people of varying age - for this we would also need to know the age distribution of the population. and would need to assume that all casualties lived in Harrow and that no Harrow residents had accidents elsewhere.

有时显示样本中每个区间的比例更为有用。所有频数通过除以样本量并乘以 100 转换为百分比。图 3.7(a) 显示了 IgM 数据的相对频率直方图,其与图 3.3 的唯一区别是纵轴的标注方式。另一种绘图方法是将直方图所有柱顶的中点连接起来,这称为频率多边形。图 3.7(b) 显示了同一数据的此类图形。
It is sometimes more useful to show the proportion of the sample in each interval. All the frequencies are converted into percentages by dividing by the sample size and multiplying by 100. Figure 3.7(a) shows the resulting relative frequency histogram for the IgM data, which differs from Figure 3.3 only in the way the vertical axis is labelled. An alternative way of plotting the data is to join the mid- points of the tops of all the vertical bars of the histogram; this is called a frequency polygon. Figure 3.7(b) shows such a plot for the same data.


图 3.7 图 3.3 中的 IgM 数据分别以 (a) 相对频率直方图和 (b) 相对频率多边形展示。
Figure 3.7 IgM data in Figure 3.3 shown as (a) Relative frequency histogram, (b) Relative frequency polygon.

直方图的纵轴必须从零开始,且刻度不应有断裂。否则视觉印象会产生误导。同样,不应使用三维效果。
The vertical axis of a histogram must start at zero, and there should not be any breaks in the scale. Otherwise the visual impression will be misleading. Likewise three- dimensional effects should not be used.

3.3.2 茎叶图 3.3.2 Stem-and-leaf diagram

一种对直方图的巧妙改进称为茎叶图,它允许显示所有实际观测值。图 3.8 将表 3.1 中的 PImax 数据重新绘制为茎叶图。通过将左侧的数字(茎)与同一行右侧的数字(叶)连接,可以重构原始数据。这是一种非常经济的原始数据再现方法,比简单的数据列表更实用。
A clever modification of the histogram called a stem- and- leaf diagram allows all the actual observations to be shown too. Figure 3.8 shows the PImax data from Table 3.1 redrawn as a stem- and- leaf diagram. The raw data can be reconstructed by joining the numbers on the left (the stems) to each of the numbers on the right (the leaves) on the same row. This is a very economical way of reproducing the raw data, and is more useful than a simple list of the data.

4 05 5 6 7 05555 8 0005 9 5555 10 000 11 000 12 05 13 0 14 15 0

茎叶图在许多情况下表现良好,特别是当数据值多样时,但最佳格式取决于数据性质和样本大小。表 3.2 中的 IgM 数据无法用五个“茎”(0、1、2、3、4)成功制作茎叶图,但我们可以拆分每个组,得到有用的图形,如图 3.9 所示。
The stem- and- leaf diagram works well in many circumstances, especially where there are many different values, but the best format depends on the nature of the data and the sample size. The IgM data in Table 3.2 cannot be made into a successful stem- and- leaf diagram using five 'stems' (0, 1, 2, 3, 4), but we can split each group to get a useful diagram, as in Figure 3.9.

3.3.3 累积频数 3.3.3 Cumulative frequencies

3.3.3 累积频率
我们之前已经看到,样本观测值的分布可以通过样本中各小区间内值所占的百分比来表示。这在图3.7的相对频率直方图中有所展示。我们可以进一步考虑,对于每个组,计算该组或更低组别中受试者的比例。因此,我们计算每个水平的累积频率—即小于或等于每个值的观测比例。计算结果见表3.4。累积相对频率可以绘制成直方图,如图3.10(a)所示。然而,对于累积频率,不必像这样分组数据,因为我们可以直接绘制累积频率,如图3.10(b)所示。该图既可用来查看任意选定水平上方或下方的观测百分比,也可用来找出某一百分比的儿童IgM值所处的具体数值。
3.3.3 Cumulative frequenciesWe saw earlier how the distribution of a sample of observations can be shown as the percentage of the sample with values in each of several small ranges. This was shown in the relative frequency histogram in Figure 3.7. We can take this idea a stage further by considering for each group the proportion of subjects in that group or a lower one. Thus we calculate the cumulative frequency at each level - the proportion of observations less than or equal to each value. The calculations are shown in Table 3.4. The cumulative relative frequencies can be plotted in a histogram, as in Figure 3.10(a). However, for cumulative frequencies there is no need to group the data like this because we can plot the cumulative frequencies directly, as in Figure 3.10(b). This plot can be used either to see what percentage of

表3.4 298个IgM值的累积频率分布
Table 3.4 Cumulative frequency distribution of 298 IgM values

IgM g/l频数相对频率 %累积频数累积相对频率 %
0.131.031.0
0.272.3103.4
0.3196.4299.7
0.4279.15618.8
0.53210.78829.5
0.63511.712341.3
0.73812.816154.0
0.83812.819966.8
0.9227.422174.2
1.0165.423779.5
1.1165.425384.9
1.262.025986.9
1.372.326689.3
1.493.027592.3
1.562.028194.3
1.620.728395.0
1.731.028696.0
1.831.028997.0
2.031.029298.0
2.120.729498.7
2.210.329599.0
2.510.329699.3
2.710.329799.7
4.510.3298100.0
总计29899.9
IgM g/lFrequencyRelative Frequency %Cumulative FrequencyCumulative Relative Frequency %
0.131.031.0
0.272.3103.4
0.3196.4299.7
0.4279.15618.8
0.53210.78829.5
0.63511.712341.3
0.73812.816154.0
0.83812.819966.8
0.9227.422174.2
1.0165.423779.5
1.1165.425384.9
1.262.025986.9
1.372.326689.3
1.493.027592.3
1.562.028194.3
1.620.728395.0
1.731.028696.0
1.831.028997.0
2.031.029298.0
2.120.729498.7
2.210.329599.0
2.510.329699.3
2.710.329799.7
4.510.3298100.0
Total29899.9


图3.10 IgM数据展示:(a) 累积相对频率直方图,(b) 累积分布图。
Figure 3.10 IgM data shown as (a) Cumulative relative frequency histogram, (b) Cumulative distribution.

该图可用来查看任意选定水平上方或下方的观测百分比,或找出某一百分比的儿童IgM值所在的具体数值。例如,我们可以很容易看出中位数IgM浓度为 。如果数据已分组,直方图或累积直方图无法直接获得此信息。
observations lie above or below any chosen level, or to find the values which a given percentage of children's IgM values lie above or below. For example, we can easily see that the median IgM concentration was . This information cannot be obtained from a histogram or cumulative histogram if values have been grouped.

累积频率对于比较两个或多个不同群体的数值分布尤为有用。图3.11(a)显示了1568名吸烟者子女和1576名非吸烟者子女的首次长牙年龄的相对频率直方图。图3.11(b)展示了相同数据的累积直方图。图3.11(c)展示了相同数据的累积频率多边形。由于我们考虑的是累积频率,连接的是垂直条的右端点,而非图3.7(b)中的中点。该图显示两组间的差异没有图3.11(b)中看起来那么大—前者两组并排显示,可能导致视觉误导。图3.11(c)清晰显示吸烟者子女首次长牙的中位年龄约提前一周。
Cumulative frequencies are especially useful for comparing the distribution of values in two or more different groups of individuals. Figure 3.11(a) shows relative frequency histograms for the age at first tooth eruption of 1568 children of smokers and 1576 non- smokers. Figure 3.11(b) shows cumulative histograms of the same data. Figure 3.11(c) shows cumulative frequency polygons of the same data. Because we are considering cumulative frequencies we join the right- hand points of the vertical bars rather than the mid- points as in Figure 3.7(b). This plot shows that the difference between the groups is not as great as was suggested in Figure 3.11(b) – the two groups were side by side in the previous plot, which can lead to a misleading visual impression. We can easily see from Figure 3.11(c) that the median age at first tooth eruption was about one week earlier in the children of smokers.

3.4 变异性的量化 3.4 QUANTIFYING VARIABILITY

图形方法对于检查数据的变异性很重要,但同样需要一种数值方法来总结变异量。与均值结合使用时,这能提供对一组观测的简明而有信息的总结。量化数据变异性的主要方法有三种:我们可以报告所有值的极差,报告从累积频率分布中得出的特定值,或获得观测值围绕均值的离散程度的数值度量。
Graphical methods are important for examining the variability of data, but it is necessary also to have a numerical way of summarizing the amount of variability. Used in conjunction with the mean, this would provide an informative but brief summary of a set of observations. There are three main approaches to quantifying the variability of a set of data. We can either quote the range of all the values, specific values derived from the cumulative frequency distribution, or we can obtain a numerical measure of the dispersion of the observations around the mean.

3.4.1 极差 3.4.1 Range

描述一组数据分布最简单的方法是报告最低值和最高值。这些值称为极差。IgM数据的极差是0.1到 。这不是一个令人满意的总结,因为它仅考虑了数据两端最极端(且可能最异常)的值,中间值的分布情况不会影响极差。因此,对于IgM数据,我们不知道4.5远高于第二高值 。主要基于此原因,极差并不常用。
The simplest way to describe the spread of a set of data is to quote the lowest and highest values. These values are known as the range. The range of the IgM data was 0.1 to . This is not a satisfactory summary, because it takes account of only the most extreme (and perhaps most peculiar) values at each end of the data, and the way the intermediate values are distributed will not influence the range. Thus for the IgM data we have no idea that 4.5 was considerably more than the second highest value of . Mainly for this reason the range is not widely used.

3.4.2 百分位数 3.4.2 Centiles

通过指定两个涵盖大部分而非全部数据值的数值,我们可以绕过大部分困难。例如,我们可以计算90%的观测值所处的区间。低于某一给定百分比的值称为百分位数(centile或percentile),对应于具有指定累积相对频率的数值。
By specifying two values that encompass most rather than all of the data values we get round much of the difficulty. For example, we could calculate the values between which of the observations lie. The value below which a given percentage of the values occur is called a centile or percentile, and corresponds to a value with a specified cumulative relative frequency.

我们需要IgM值分布的第5和第95百分位数。从表3.4的最后一列可以看到,累积相对频率在IgM值为0.3 g/l的组别中超过了5%,而95%则在1.6 g/l处达到。
We require the 5th and 95th centiles of the distribution of IgM values. From the last column of Table 3.4 we can see that the cumulative relative frequency passes somewhere in the group of IgM values of , and is reached at the value of .

更正确的通用方法是计算所需观测值的秩次,通过取样本量乘以相应百分比再加一来实现。这里我们需要秩次为0.05 × 299 = 14.95和0.95 × 299 = 284.05的值。此计算通常得到非整数值,因此可能需要插值。例如,我们想要第5百分位数,即第14和第15个秩次之间0.95倍位置的IgM值。根据表3.4,这两个秩次的IgM值均为0.3 g/l,因此第5百分位数为0.3 g/l;同理,第95百分位数为1.7 g/l。然而,如果我们想要第10百分位数,则需要秩次为0.10 × 299 = 29.9的IgM值。秩次29和30的观测值分别为0.3和0.4 g/l,我们通过计算0.3 + 0.9 × (0.4 - 0.3) = 0.39 g/l来进行插值。由此,0.3和1.7分别是该儿童样本中IgM观测分布的第5和第95百分位数,这两个值定义了一个90%的中心范围—即中央90%的值所在区间(排除分布两端各5%的值)。
A more correct general approach is to calculate the ranks of the required observations, which we do by taking the necessary percentages of the sample size plus one. Here we need the values with ranks and . This calculation usually leads to non- integer values, so we may need to interpolate. For example we want the value of IgM 0.95 of the way between the 14th and 15th observations in rank order. As these are, from Table 3.4, both equal to the 5th centile is , and likewise the 95th centile is . However, if we want the 10th centile, we would need the IgM value corresponding to a rank of . The observations with ranks 29 and 30 are 0.3 and and we take the value nine- tenths of the way between these values, by calculating . The values 0.3 and 1.7 are thus the 5th and 95th centiles of the observed distribution of IgM in this sample of children and these two values thus specify what we can call a central range—the range within which the central of values lie (i.e. excluding at each end of the distribution).

除了第5和第95百分位数外,还可以引用其他百分位数。最常见的替代是引用95%的中心范围(第2.5和第97.5百分位数),但有时也使用80%的中心范围(第10和第90百分位数)。第50百分位数即中位数,因为一半的观测值小于(或大于)该值。第25和第75百分位数称为四分位数;这两个值与中位数共同将数据分成四个等人数子组。第25和第75百分位数之间的数值差称为四分位距,有时用于描述变异性。
Other centiles can be quoted rather than the 5th and 95th. The most common alternative is to quote a central range ( th and th centiles), but an central range (10th and 90th centiles) is sometimes used. The 50th centile is another name for the median, as half of the observations are less than (and greater than) this value. The 25th and 75th centiles are known as quartiles; these values together with the median divide the data into four equally populated subgroups. The numerical difference between the 25th and 75th centiles is the inter- quartile range, and is occasionally used to describe variability.

使用分位数总结数据的一种简单但有用的半图形方法是箱线图。图3.12展示了IgM数据的箱线图。箱体表示下四分位数和上四分位数,中间的线是中位数。须的末端点是 的值,尽管有时须表示极端值。对于单组数据,直方图更具信息量,但多组数据可以用箱线图经济地总结。有时超出须范围的值会单独绘出。
A simple but useful semi- graphical way of summarizing data using centiles is the box- and- whisker plot. Figure 3.12 shows a box- and- whisker plot for the IgM data. The box indicates the lower and upper quartiles and the central line is the median. The points at the ends of the 'whiskers' are the and values, although the whiskers sometimes indicate the extreme values. For a single set of data a histogram is more informative, but several sets of data can be summarized economically using the box- and- whisker plot. Sometimes any values outside the range of the whiskers are plotted individually.

3.4.3 标准差 3.4.3 Standard deviation

量化变异性的另一种方法基于计算每个值与均值距离的平均值。对于个体而言,
The alternative approach to quantifying variability is based on the idea of averaging the distance each value is from the mean. For an individual with


图3.12 IgM数据的箱线图,显示了 、25、50、75 和 的累积相对频率(分位数)。
Figure 3.12 Box-and-whisker plot of the IgM data, showing the , 25, 50, 75 and cumulative relative frequencies (centiles).

观察值 与均值 的距离为 ,若有 个观察值,则有一组 个这样的距离,每个个体一个。低于均值的观察值距离为负。我们可以计算观察值与均值之间距离的平均值,但这些距离的和 总是零,因为均值是由个体观察值计算得出。然而,如果先将距离平方再求和,得到的量必为正。平方差的平均值因此衡量了个体相对于均值的偏差。这个量称为方差,定义为
an observed value the distance from the mean is , and if we have observations we have a set of such distances, one for each individual. For observations below the mean the difference will be negative. We can calculate the average distance between the observations and their mean, but the sum of these distances, , is always zero because of the way the mean is calculated from the individual observations. However, if we square the distances before we sum them we get a quantity that must be positive. The average of these squared differences thus gives a measure of individual deviations from the mean. This quantity is called the variance, and is defined as

注意,我们除以的是 而不是更直观的 。除以 得到的是观测值围绕样本均值的方差,但我们几乎总是将数据视为来自某个更大总体的样本,且希望用样本数据来估计总体的变异性。除以 能更好地估计总体方差,虽然对于大样本而言,两者差异可忽略不计。
Note that we divide by rather than the more obvious . Dividing by gives the variance of the observations around the sample mean, but we virtually always consider our data as a sample from some larger population, and wish to use the sample data to estimate the variability in the population. Dividing by gives us a better estimate of the population variance, although clearly for large samples the difference is negligible.

方差将在后续章节中出现,尤其是在讨论称为方差分析的技术时。就目前目的而言,
The variance will turn up in later chapters, notably when discussing the technique known as analysis of variance. For our present purpose, the

方差并不是描述变异性的合适指标,因为它的单位与原始数据不同。例如,我们不希望用平方毫米汞柱来表达一组血压测量值的变异性。解决这一问题的显而易见方法是取方差的平方根作为我们的度量。我们称此量为标准差。标准差通常缩写为 sd、SD、(希腊字母西格玛),定义为
variance is not a suitable measure for describing variability because it is not in the same units as the raw data. We do not, for example, wish to express the variability of a set of blood pressure measurements in square mm Hg. The obvious solution to this problem is to take as our measure the square root of the variance. We call this quantity the standard deviation. The standard deviation is usually abbreviated to sd or SD or or (the Greek letter sigma), and is defined as

标准差这个名称并不十分恰当,因为它并没有“标准”的含义。更合理的理解是它大致表示观测值偏离均值的平均距离(或偏差)。
Standard deviation is not a good name for this statistic as there is nothing 'standard' about it. It may more reasonably be thought of as approximately the average deviation (or distance) of the observations from the mean.

许多计算器可以通过标记为 的键计算标准差。(这里使用希腊字母 而非 并非严格正确,下一章将对此作解释。如果有标记为 的键,应使用后者。)
Many calculators can calculate the standard deviation, by means of a key marked or . (The use of the Greek here rather than is not strictly correct, as will be explained in the next chapter. If there are keys marked and the latter should be used.)

然而,如果我们想自己计算,有一个更简便的公式,数学上等价于上式:
However, should we wish to do the calculation ourselves there is a much easier formula to use, which is mathematically equivalent:

(关于 符号的简化,详见附录A。)使用此公式,我们可以仅通过观测值之和 和观测值平方和 计算标准差,无需计算每个观测值与均值的距离。
(Note the simplification of the notation, as described in Appendix A.) Using this formula we can calculate the standard deviation from the sum of the observations, , and the sum of the squares of the observations, . We do not need to calculate the individual distances from the mean.

例如,对于表3.1中显示的PImax数据,数据的总和及其平方和分别为
For example, for the PImax data shown in Table 3.1 the sum of the data and the sum of the squares of the data are

因此,PImax的均值为 ,标准差为
so the mean PImax is and the standard deviation is

注意,我目前会保留均值和标准差多一位小数,因为接下来还会进行一些计算。报告结果时一位小数已经足够。
Note that I shall keep an extra decimal place at present for the mean and standard deviation because I shall be doing some further calculations. One decimal place would be sufficient when reporting these results.

标准差在数据分析中扮演重要角色,但这里我们关注其作为描述性统计量的价值。实际上,虽然标准差广泛用于此目的,但它仅间接用于描述数据的变异性。比如,在许多情况下,大多数(约95%)的观察值会落在均值的两个标准差范围内。该说法的适用性取决于数据分布的形态。如果分布较为对称,上述说法通常成立。
The standard deviation has an important role in data analysis, but here we are concerned with its value as a descriptive statistic. In fact, although the standard deviation is widely used for this purpose it is useful only indirectly for describing the variability of a set of data. We can say, for example, that in many circumstances the large majority (about ) of a set of observations will be within two standard deviations of the mean. The appropriateness of this statement depends on the shape of the distribution of the data. If the distribution is reasonably symmetric then the above statement will usually be true.

对于图3.8中的PImax数据,均值为92.60,标准差为 。均值两侧两个标准差的值分别为 。(我们通常用“均值±2SD”来表示这两个值,即均值“加减”两倍标准差。)25个观察值中,除两个外均落在此范围内;平均来说,我们期望有一个观察值落在均值±2SD之外(即约5%的25个观察值)。
For the PImax data in Figure 3.8 the mean was 92.60 and the standard deviation was . The values that are two standard deviations either side of the mean are and . (We often use the expression 'mean ' to mean both of these values, i.e. the mean 'plus or minus' twice the standard deviation.) All but two of the 25 observations were within this range; we would expect to find on average one observation outside the range mean (i.e. about of 25).

3.4.4 偏态分布 3.4.4 Skewed distributions

对于非对称分布的数据,使用标准差时需谨慎。比如,图3.3中的IgM数据明显呈非对称分布—存在较长的右侧“尾巴”,称为偏态分布。IgM数据的均值和标准差分别为0.80和 。计算均值±2SD得到的值为 和 1.74。下限为负值,IgM不可能为负。上限1.74被12个观察值超出,占总数的4%。这两个值显然不能很好地描述大部分数据的范围。虽然它们仍包含约95%的观察值,但超出部分都集中在一侧尾部。
For data which do not have a symmetric distribution we need to be careful when using the standard deviation in the way just described. For example, the IgM data in Figure 3.3 clearly have an asymmetric distribution- - there is a long right- hand 'tail'. This is called a skewed distribution. The mean and standard deviation of the IgM data are 0.80 and respectively. Calculating the mean gives the values and 1.74. The lower value is negative, which is not a possible value of IgM. The upper value of 1.74 is exceeded by 12 of the observations, of the total. The two values clearly do not describe the range of the bulk of the data very well. Although they still include about of the observations, the exclusions are all in one tail.

对于不能为负的测量值(通常如此),如果标准差超过均值的一半,则可推断数据呈偏态分布。但反之不一定成立,直方图能快速显示数据是否偏态。像IgM数据的偏态称为正偏态,较为常见。相反,左侧尾巴延长的现象称为负偏态,较为罕见。
For measurements that cannot be negative, which is usually the case, we can infer that the data have a skewed distribution if the standard deviation is more than half the mean. There is no guarantee that the converse is true, however, but a histogram will quickly reveal whether the data are skewed or not. Skewness like that of the IgM data is called positive skewness and is common. The opposite phenomenon, with an extended left hand tail, is called negative skewness and is rare.

一般来说,当数据呈偏斜分布时,我们会采用其他方式来描述数据。主要有两种可能性。第一种是对数据进行数学变换,使变换后的数据分布更接近对称。最常用的方法是对数据取对数(logs)。这种方法的原理将在第7章讨论。
In general, when we have data with a skewed distribution we use other ways of describing the data. There are two main possibilities. The first is to transform the data mathematically so that the transformed data have a more nearly symmetric distribution. The most frequent device is to take logarithms (logs) of the data. The rationale for this approach will be

不过,我们可以从图3.13中看到它的效果,该图显示了 值的直方图。对数数据的均值和标准差分别为 ,因此均值 的值为 。这些值在图3.13中有所标示。它们截断了分布下尾的10个值和上尾的6个值,从而涵盖了 的观测值。截断值可以“反变换”回原始尺度,得到 ,参考表3.2显示有16个值超出这些界限。如果我们对对数数据的均值进行反变换(或“反对数”),得到的量称为几何均值。 数据的几何均值为 。当对数变换成功消除偏斜时,几何均值将接近中位数,并且小于原始数据的算术均值。对数数据的标准差不能有意义地反变换。
discussed in Chapter 7. We can see that it works well here, however, from Figure 3.13 which shows a histogram of values. The mean and SD of the log data are and respectively, so that the values mean are and . These values are indicated in Figure 3.13. They cut off 10 values in the lower tail of the distribution and 6 in the upper tail, and thus give a range of values encompassing or of the observations. The cut- off values can be 'back- transformed' to the original scale giving and , and reference to Table 3.2 shows the 16 values outside these limits. If we back- transform (or 'antilog') the mean of the log data we get a quantity known as the geometric mean. The geometric mean of the data is thus . Where log transformation successfully removes skewness the geometric mean will be similar to the median, and will be less than the mean of the raw data. The standard deviation of the log data cannot be meaningfully back- transformed.

注意,对数数据可以是负值,且无论使用自然对数还是以10为底的对数都无所谓。在本例中使用的是以10为底的对数,反变换使用函数 。对数变换仅对消除正偏斜有用。
Note that log data can be negative, and that it does not matter whether logs to base e or base 10 are used. In this example, logs to base 10 were used, with the function used for the back- transformation. Log transformation is only useful for removing positive skewness.

描述偏斜数据分布的另一种方法是计算对应于选定中心范围的分位数。例如,要获得包含 观测值的范围,需要计算第 和第 百分位数。使用上一节描述的方法,
The alternative approach to describing the distribution of skewed data is to calculate the centiles corresponding to a chosen central range. For example, to get the values that enclose of the observations we need to calculate the th and th centiles. Using the method described in the


图3.13 显示了 的频数直方图及均值 的值。
Figure 3.13 Frequency histogram of showing the values of mean .

通过插值法得到的这两个百分位数分别为0.2和 。这两个值称为经验分位数,与之前由对数数据的均值 计算得出的0.23和2.08(估计分位数)不同。两种方法在这些数据上结果相当一致。同样,IgM的中位数为 ,与几何均值非常接近。
previous section, these values are obtained by interpolation as 0.2 and . These values of 0.2 and are called empirical centiles as opposed to the earlier values of 0.23 and 2.08 (obtained from the mean of the log data), which are estimated centiles. The two methods agree well for these data. Likewise the median IgM value is , which is very close to the geometric mean.

3.4.5 评述 3.4.5 Comment

标准差是统计分析中的关键量之一。其描述变异性的有效性依赖于数据的分布。虽然计算标准差总是有效的,但只有在我们知道(或假设)数据分布较为对称时,才能推断约 的观测值落在均值 的区间内。实际上,如IgM数据所示,即使分布偏斜,均值 的范围也可能包含约 的观测值。然而,虽然我们可以合理地用均值和标准差来总结这类数据,但偏斜性会被掩盖。对于偏斜数据,最好使用中位数和 的中心范围来总结观测值。但对于小样本,不便引用分位数,因此可给出范围。否则,可以使用标准差。它的优点是直接利用每个观测值,且对于大量数据计算更方便(计算机辅助)。
The standard deviation is one of the key quantities in statistical analysis. Its value for describing variability is conditional on the distribution of the data. Although it is always valid to calculate the standard deviation we can infer that about of the observations were in the interval mean only if we know (or assume) that the distribution of the data was reasonably symmetric. In fact, as happens with the IgM data, the range mean may include about of the observations even when the distribution is skewed. However, while we may reasonably use just the mean and SD to summarize such data, the skewness will be hidden. For skewed data, it is preferable to use the median and a or central range to summarize a set of observations. However, it is not practical to quote centiles for small samples, so the range can be given. Otherwise, the standard deviation can be used. It has the advantage of using each observation directly and it is easier to calculate (by computer) for large amounts of data.

数据分布形态的问题在选择分析方法时至关重要,后续章节将对此进行详细说明。
The question of the shape of the distribution of one's data is of fundamental importance when choosing a method of analysis, as will be seen in later chapters.

3.5 TWO VARIABLES

3.5.1 描述两个或多个组的数据 3.5.1 Describing data in two or more groups

在许多研究中,会对不同组别进行比较。例如,两组患者可能接受不同的治疗并观察其结果。在这类研究中,理想情况下应证明两组受试者在研究开始时的特征是可比的。作为示例,表3.5展示了一项临床试验中各组受试者的特征,该试验比较了短波透热治疗、整骨治疗和无效安慰剂治疗对非特异性腰痛患者的效果(Gibson 等,1985)。三组在研究开始时(通常称为“基线”值)的特征以分类变量的数量和百分比表示,连续变量则以均值和标准差表示。这些信息
In many studies comparisons are made between different groups. For example, two groups of patients may be given different treatments and the outcomes observed. It is desirable in such studies to demonstrate that the characteristics of the two groups of subjects were comparable at the start of the study. As an example, Table 3.5 shows the characteristics of the groups of subjects in a clinical trial comparing short- wave diathermy treatment, osteopathic treatment, and an ineffective placebo treatment in patients with non- specific low back pain (Gibson et al., 1985). The characteristics of the three groups at the start of the study (often called 'baseline' values) are shown as numbers and percentages for categorical variables, and as means and standard deviations for the two continuous variables. This information

通常足以判断各组的可比性。关于如何评估组间的可比性,我将在第15章讨论。目前我们可以看到,疼痛持续时间的均值呈偏态分布,因为三组的均值都远小于两倍的标准差。
is usually sufficient to judge the comparability of the groups. I shall consider how we assess whether they are comparable in Chapter 15. For the moment we can see that the mean duration of pain had a skewed distribution as the mean is a lot less than twice the standard deviation in all three groups.

表3.5 低背痛研究中各治疗组患者的详细信息(Gibson 等,1985)
Table 3.5 Details of patients in each treatment group in a study of low back pain (Gibson et al., 1985)

治疗组
短波透热整骨治疗安慰剂
患者数量344134
性别16女/18男21女/20男11女/23男
平均年龄(标准差)35 (16)34 (14)40 (16)
疼痛持续时间(周,标准差)18 (11)16 (14)17 (11)
入组时疼痛评分中位数(范围)*45 (5-82)35 (4-90)48 (10-96)
脊柱放射学异常12 (34%)12 (29%)11 (32%)
Treatment group
Short-wave diathermyOsteopathyPlacebo
Number of patients344134
Sex16F/18M21F/20M11F/23M
Mean age (SD)35 (16)34 (14)40 (16)
Mean duration of pain in weeks (SD)18 (11)16 (14)17 (11)
Median pain score at pre-sentation (range)*45 (5-82)35 (4-90)48 (10-96)
Radiological abnormalities of the spine12 (34%)12 (29%)11 (32%)
  • 视觉模拟量表
    *Visual analogue scale

有时我们希望用图形方式展示两个或多个组中连续变量的分布。这可以通过为每组绘制独立的直方图实现,并将它们垂直排列,但有一种更清晰的格式能显示所有观测值。图3.14展示了一组女性在怀孕前、怀孕期间及产后尿酸的分布(Lind 等,1984)。图中展示了所有数据,作者还给出了各阶段的均值、标准差和观测数。这幅信息丰富的图形有效地结合了表格的功能,同时占用的空间很小。条形图常用于显示各组的均值和标准差,但这种格式并不理想—这类信息更适合用表格呈现,或者使用更具信息量的展示方式,如图3.14所示的图形或箱线图。
Sometimes we wish to show graphically the distribution of a continuous variable in two or more groups. This can be done by means of a separate histogram for each group, these being aligned vertically, but there is a rather clearer format that shows all the observations. Figure 3.14 shows the distribution of uric acid in a group of women before, during and after pregnancy (Lind et al., 1984). All the data are shown in the graph, and the authors have also given the mean, standard deviation and number of observations at each stage. This informative figure thus effectively incorporates a table while using little extra space. Bar diagrams are often used to show means and standard deviations in each group. This is not a good format – this information is better in a table, or else a more informative display, such as that in Figure 3.14 or a box- and- whisker diagram, should be used.


图3.14 健康女性在怀孕前、怀孕期间及产后血清尿酸的分布(经Lind等,1984年授权转载)。
Figure 3.14 Distribution of serum uric acid in a group of healthy women before, during and after pregnancy (reproduced from Lind et al., 1984, with permission).

3.5.2 两个连续变量之间的关系 3.5.2 Relation between two continuous variables

两个连续变量之间的关系可以用散点图直观展示。散点图是一种简单的图形,将一个变量的值绘制在另一变量的对应坐标上。例如,图3.15展示了表3.1中PImax数据与年龄的散点图。使用统计软件制作散点图非常简单。当存在两个或更多个体在两个变量上数值相同,应通过稍微移动点的位置来显示重叠点。有些软件包会显示重叠点的实际数量,最多显示9,数字“9”表示“9个或以上”。通过使用不同的绘图符号,可以轻松区分子组。例如,图3.15中可以用实心圆和空心圆分别表示男性和女性。散点图是一种非常有用的描述工具,常作为正式统计分析的前奏。图3.14实际上是连续变量与分类变量的散点图。
The relation between two continuous variables may be shown graphically in a scatter diagram. This is a simple graph in which the values of one variable are plotted against those of the other. For example, Figure 3.15 shows a scatter diagram of the PImax data of Table 3.1 related to age. Scatter diagrams are very simple to produce using statistical computer programs. When there are two (or more) individuals with identical values of both variables this should be shown, preferably by moving one point slightly. Some software packages print the actual number of coincident points up to 9, so that '9' means '9 or more'. It is easy to indicate subgroupings by using different plotting symbols. For example, in Figure 3.15 males and females could have been indicated by closed and open circles. The scatter diagram is a very useful descriptive tool, and is often valuable as a prelude to formal statistical analysis. The graph in Figure 3.14 is really a scatter diagram relating a continuous and a categorical variable.


图3.15 PImax与年龄的散点图。
Figure 3.15 Scatter diagram of PImax by age.

3.6 数据变换的影响 3.6 THE EFFECT OF TRANSFORMING THE DATA

如果我们以某种方式改变数据,均值和标准差也必然会发生变化。在某些情况下,我们会对整组数据进行变换,此时均值和标准差的变化是可以预测的。
If we change our data in some way we will inevitably change the mean and standard deviation too. In some situations we alter, or transform, a complete set of data, in which case the effect on the mean and standard deviation may be predicted.

最简单的情况是改变测量单位。如果我们将IgM数据从以 记录的值改为 ,每个观测值将变为原来的1000倍。很容易看出,均值也会变为原来的1000倍,查看标准差的公式也能发现标准差同样会变为1000倍。相比之下,如果我们对所有观测值加上或减去一个常数,新数据的均值只需相应地加上或减去该常数,而标准差则不受影响。因此,对于以摄氏度记录的一组温度数据,要得到相应的热力学温度(开尔文温标)的均值,需要在摄氏度均值上加273.15。
The simplest case to consider is where we alter the units of measurement. If we change the IgM data from values recorded as to each observation will be 1000 times as large. It is easy to see that the mean will also be 1000 times bigger, and inspection of the formula for the standard deviation shows that it too will be 1000 times bigger. In contrast, if we add or subtract a constant value from all the observations, the mean of the new data is obtained by the same subtraction or addition but the standard deviation is unaffected. Thus to the mean of a set of temperatures recorded as degrees Celsius we must add 273.15 to give the mean of the equivalent thermodynamic temperature on the Kelvin scale.

基于乘法、除法、加法或减法的变换称为线性变换,因为如果将新值与原始值绘图,会得到一条直线。变换后数据的均值和标准差可以通过简单的方法获得。然而,对于其他非线性变换,则无法通过这种方式得到变换后数据的均值和标准差。非线性变换的例子包括取对数(见第3.4.4节)或开方。因此,对数数据的均值不等于原始数据均值的对数。数据变换的原因将在第7章讨论。
Any transformation based on multiplication, division, subtraction or addition is called a linear transformation, because if we plot the new values against the original values we get a straight line. The mean and standard deviation of the transformed values are obtained in a simple manner. For other, non- linear transformations, however, we cannot obtain the mean and standard deviation of the transformed data in this way. Examples of non- linear transformation are taking logarithms (illustrated in section 3.4.4) or square roots. Thus the mean of the log data is not the same as the log

of the mean of the raw data. The reasons for transforming data are considered in Chapter 7.

3.7 数据展示 3.7 DATA PRESENTATION

3.7.1 数值展示 3.7.1 Numerical presentation

数据汇总不应仅用均值(或中位数),还应提供一定的变异性指标。通常在均值后用括号标注标准差(SD)。在正文中引用这些数值时,应避免使用均值 ± 标准差的格式,如“他们的平均舒张压为 ”。(事实上,许多医学期刊已不再允许这种表示法。)更好的写法是 (SD 11.9),因为这种格式明确了第二个数字的含义,同时避免了暗示均值减去标准差到均值加上标准差的范围具有特殊意义。如前所述,通常用均值 ± 2倍标准差来描述大多数(约95%)观测值的分布范围。
Data summary should not be by the mean (or median) alone, but some indication of variability should also be provided. It is common to put the SD in brackets after the mean. When these values are quoted in text the format mean ,as in 'their mean diastolic blood pressure was , should be avoided. (Indeed several medical journals no longer allow this notation.) It is much better to write (SD 11.9) because this format makes it clear what the second number is and also avoids the implication that the range of values from mean to mean is of specific importance. As we have seen, it is the range mean which can often be used to describe the spread of the large majority (about ) of a set of observations.

数值展示没有绝对规则,但以下指导原则通常合理。通常均值的精度应比原始数据多一位小数。均值不应呈现荒谬(且无意义)的“精确度”。例如,将一组婴儿的平均妊娠期精确到10分钟显然是不合理的,这种情况常见于将妊娠周数精确到小数点后三位。标准差通常应与均值保持相同的精度,或多一位小数。
It is not possible to give absolute rules for numerical presentation, but the following guidelines will generally be reasonable. It is usually appropriate to quote the mean to one extra decimal place compared with the raw data. The mean should not be presented to ridiculous (and spurious) 'accuracy'. For example, it is clearly absurd to quote the mean length of gestation of a group of babies to the nearest 10 minutes. This is done when quoting weeks of gestation to 3 decimal places. The standard deviation should usually be given to the same accuracy as the mean, or with one extra decimal place.

3.7.2 表格 3.7.2 Tables

是否将描述性数据放入表格,取决于变量和受试者组的数量。表3.5展示了一种推荐的描述性数据呈现方式,包括连续型和分类变量。一般来说,将同类数据放在列中比放在行中更好,因为眼睛更容易扫描列,但这并非总是可行。例如,在表3.5中,三个治疗组中相同变量的均值是按行显示的,因为这样通常更自然。然而,均值和标准差并列显示,后者用括号标出以示清晰。
Whether or not to put descriptive data in tables will depend on the number of variables and groups of subjects. Table 3.5 shows a recommended way of presenting descriptive data, both continuous and categorical. In general it is preferable to put data of a like kind in columns rather than rows as the eye can scan columns more easily, but this is not always possible. For example, in Table 3.5 the means of the same variables in the three treatment groups are shown in rows, as it is usually more natural that way. However, means and SDs are clearly distinguished side by side, with the latter in brackets for clarity.

表格也可以用来展示原始数据,但这仅在观察数量不多时合理。若可能,按某一变量排序数据是有益的—毕竟患者的就诊顺序通常没有特殊意义。本书中的许多表格,如表3.1,都是这样排序的。
Tables can also be used to show raw data, although this is only reasonable when there are not too many observations. Where possible, it is helpful to order the data by one of the variables - after all, there is usually nothing special about the order in which the patients were seen. Many of the tables in this book, such as Table 3.1, have been ordered in this way.

3.7.3 图形 3.7.3 Graphs

很难给出何时使用图形而非表格的通用建议。图形可以展示比表格更多的数据,因此更适合那些难以用表格展示的数据。例如,用图形展示两三个组中某一变量的均值和标准差没有意义。一些显示方式本质上是图形的—图3.3比表3.2更清晰。图3.14展示了表格和图形优点结合的例子,这种展示方式应更常使用。
It is difficult to offer much general advice about when it is appropriate to use a graph rather than a table. Graphs offer the opportunity to show much more data than could be shown in a table, and are thus probably most suited to data that cannot easily be displayed in a table. There is no point in using a graph to show, for example, the means and standard deviations of one variable in two or three groups. Some displays, such as histograms, are in essence graphical - Figure 3.3 is a much clearer display than Table 3.2. It is possible to combine the best features of a table and a figure, and an example was given in Figure 3.14. This form of display should be used more often.

散点图特别适合展示两个变量之间的关系。重要的是所有数据点都应显示,这在有重叠点时会有困难(见第6.7节)。可用不同符号表示数据的子组。
Scatter diagrams are particularly useful for showing the relation between two variables. It is important that all the data points should be shown, which can pose difficulties when there are coincident points (see section 6.7). Different symbols can be used to indicate subgroups of the data.

图形是传达信息的强大工具,但同一数据可以用多种方式和视觉效果呈现。例如,图3.16展示了表3.6中1960年至1980年伦敦人均每周面包消费量数据的三种不同展示方式。图中可见的特点包括总面包消费量逐渐减少,白面包消费量下降幅度超过总量,而棕面包和全麦面包在最后五年消费量上升。这些特点在表3.6中可能更易观察。
Graphs are a very powerful way of getting a message across, but the same data can be portrayed in many ways, with a variety of visual effects. For example, Figure 3.16 shows three alternative displays of the data in Table 3.6 showing average amounts of bread consumed per person per week in London from 1960 to 1980. Features visible in one or more figures include a gradual reduction in total bread consumption, a more than proportionate fall in consumption of white bread, and a rise in consumption of brown and wholemeal bread in the last five year period. These features are probably more easily seen in Table 3.6.

表3.6 1960年至1980年伦敦面包消费量(克/人/周)(Sivell和Wenlock,1983)
Table 3.6 Amounts of bread consumed in London from 1960 to 1980 (g per person per week) (Sivell and Wenlock, 1983)

面包类型年份
19601965197019751980
白面包1040975915785620
棕面包70807075115
全麦面包2520152045
其他155808575105
总计129011551080955880
Type of breadYear
19601965197019751980
White1040975915785620
Brown70807075115
Wholemeal2520152045
Other155808575105
Total129011551080955880

关于图形方法的优秀著作有Tufte(1983),统计图形则由Moses(1987)讨论。许多创新的描述方法由Tukey(1977)提出。
An excellent book on graphical methods in general is that by Tufte (1983), and graphs for statistics are discussed by Moses (1987). Many innovative ideas for descriptive methods are described by Tukey (1977).

练习 EXERCISES

3.1 表格显示了65名接受硫代金注射液(SA)治疗的类风湿性关节炎患者的一些数据(Ayesh 等,1987)。表中展示了SA的总剂量,以及硫氧化指数(SI)的数值,SI衡量的是将有机二价烷基硫化物转化为相应硫氧化物的能力。患者被分为28名无主要不良反应组和37名有主要不良反应组。
3.1 The table overleaf shows some data for 65 patients with rheumatoid arthritis treated with sodium aurothiomalate (SA) (Ayesh et al., 1987). The total dose of SA is shown, together with values of the sulphoxida- tion index (SI), which measures the capacity to convert organic divalent alkyl sulphide to its corresponding sulphoxide form. The patients have been separated into 28 without and 37 with major adverse reactions to the drug.

(a) 有些SI的数值标记为 。这种类型的观测值叫什么名称?
(a) Some values of SI are given as . What is the name given to observations like this?

(b) 绘制每组SI的直方图有什么困难?这些分布呈现什么形状?
(b) What is the difficulty about drawing histograms of SI in each group? What shape are the distributions?

(c) 给出两个理由说明为何用中位数而非均值来描述平均SI值更合适。
(c) Give two reasons why it is preferable to calculate the median rather than the mean to describe the average SI value.

(d) 计算每组患者的SI中位数。(这应该不超过十秒钟。)
(d) Obtain the median SI for each group of patients. (This should take less than ten seconds.)

(e) 计算有不良反应组的SA总剂量中位数。
(e) Obtain the median total dose of SA for the group with adverse reactions.

(f) 制作茎叶图以比较两组患者的年龄分布。
(f) Produce stem-and-leaf diagrams to compare the age distributions in the two groups.

(g) 数据是否支持有不良反应患者平均年龄高于无不良反应患者的观点?
(g) Do the data support the idea that patients experiencing adverse reactions were on average older than those without adverse reactions?

无不良反应 有不良反应
年龄SA总剂量(毫克)SI 年龄SA总剂量(毫克)SI
14415601.01533602.0
26513101.227420102.0
3588501.232913902.0
45712501.74536603.0
5519501.856711353.5
6648501.86675105.3
73312001.97544105.7
86113902.08519106.5
94914502.395736013.0
106733002.81062126013.0
113927602.8115156013.9
12428603.41268113514.7
133518103.41350141015.4
143113103.81438111015.7
153712503.8156196016.6
164312104.21659131016.6
173914604.9176891016.6
185323105.41844123522.0
194413605.91957295022.3
204119106.2204936033.2
217291012.02149193547.0
2261141018.82263166061.0
2348246047.0232943565.0
2459135070.0245331065.0
2572810>80.02553310>80.0
26591460>80.02649410>80.0
2771760>80.02742690>80.0
2853910>80.02844910>80.0
29591260>80.0
30511260>80.0
31461310>80.0
32461350>80.0
33411410>80.0
34391460>80.0
35621535>80.0
36491560>80.0
37532050>80.0
Without adverse reactions With adverse reactions
AgeTotal dose of SA (mg)SI AgeTotal dose of SA (mg)SI
14415601.01533602.0
26513101.227420102.0
3588501.232913902.0
45712501.74536603.0
5519501.856711353.5
6648501.86675105.3
73312001.97544105.7
86113902.08519106.5
94914502.395736013.0
106733002.81062126013.0
113927602.8115156013.9
12428603.41268113514.7
133518103.41350141015.4
143113103.81438111015.7
153712503.8156196016.6
164312104.21659131016.6
173914604.9176891016.6
185323105.41844123522.0
194413605.91957295022.3
204119106.2204936033.2
217291012.02149193547.0
2261141018.82263166061.0
2348246047.0232943565.0
2459135070.0245331065.0
2572810>80.02553310>80.0
26591460>80.02649410>80.0
2771760>80.02742690>80.0
2853910>80.02844910>80.0
29591260>80.0
30511260>80.0
31461310>80.0
32461350>80.0
33411410>80.0
34391460>80.0
35621535>80.0
36491560>80.0
37532050>80.0

3.2 (a) 图3.1是否表明专业飞行员比其他群体更可能发生航空事故?
3.2 (a) Does Figure 3.1 indicate that professional pilots are more likely to have an aviation accident than other groups?

下表显示了图3.1中绘制的数据,以及最近飞行时间每10万小时的航空事故率(Booze,1977)。
The following table shows the data that were plotted in Figure 3.1, together with the aviation accident rates per 100000 hours of recent flight time (Booze, 1977).

事故次数每千小时发生率*每10万小时发生率
专业飞行员130215.90.2
律师5711.01.5
农民16610.11.3
销售代表1379.01.2
医生768.71.8
机械师和修理工446.91.5
警察和侦探486.61.8
经理和行政人员6436.00.7
工程师1254.71.1
教师434.21.1
家庭主妇293.73.2
学术学生1883.23.7
军队成员1111.60.7
Number of accidentsRate per 1000*Rate per 100000 hr
Professional pilots130215.90.2
Lawyers5711.01.5
Farmers16610.11.3
Sales representatives1379.01.2
Physicians768.71.8
Mechanics and repairmen446.91.5
Policemen and detectives486.61.8
Managers and administrators6436.00.7
Engineers1254.71.1
Teachers434.21.1
Housewives293.73.2
Academic students1883.23.7
Armed Forces Members1111.60.7

*在指定的职业中
*in the specified occupation

(b) 每10万小时的发生率也可以制成条形图。通过这样的图表,或通过表中显示的数据,哪两个飞行员群体发生的事故最多?为什么这两组数据会给出不同的答案?(散点图有助于观察两者之间的关系。)
(b) The rates per 100000 hours can also be made into a bar diagram. From such a diagram, or from the figures shown in the table, which two groups of pilots had most accidents? Why do the two sets of figures give different answers? (A scatter diagram is useful to see the relation between the two.)

3.3 使用第3.4.2节中给出的方法,计算用于构建图3.12箱线图的百分位数。
3.3 Calculate the centiles used to construct the box- and- whisker plot in Figure 3.12 using the method of calculation given in section 3.4.2.

4 理论分布 4 Theoretical distributions

4.1 引言 4.1 INTRODUCTION

上一章强调了属性或反应的变异性的重要性。若无这种变异性,事件将完全可预测,统计方法也就无从谈起。正因为存在变异性,我们才需要统计分析来揭示事物的本质。例如,虽然现在普遍接受吸烟有害健康的观点,但这一认识直到20世纪40至50年代经过大量细致研究后才得以确立(Doll 和 Hill,1950)。尽管吸烟显著增加心脏病、肺癌及其他疾病的风险,但由于对吸烟的反应高度变异,这种影响曾一度被掩盖。一些重度吸烟者活到80或90岁,而许多不吸烟者却在60岁之前去世。显然,无论是观察性还是实验性研究,检测效应的能力取决于效应的平均大小和效应的变异性。我们将看到,这两者之间的平衡构成了许多主要统计方法的基础。
The importance of variability in attributes or responses was emphasized in the previous chapter. Without such variability events would be entirely predictable, and there would be no need for statistical methods. Because there is variability, we need statistical analysis to unravel what is going on. For example, while it is now universally accepted that cigarette smoking is hazardous to health, realization that this was so did not come until much careful research was carried out beginning in the 1940s and 1950s (Doll and Hill, 1950). Although the risk of heart disease, lung cancer and other diseases is considerably increased by smoking, the effect was masked because the response to smoking is highly variable. Some heavy smokers live to 80 or 90, whereas many non- smokers die before they are 60. Clearly the ability to detect effects, whether in observational or experimental studies, depends upon both the magnitude of the effect on average, and the variability of the effect. We will see that the balance between these ideas is behind a large number of the main statistical methods.

统计方法应用中的另一个基本概念是概率。我们在日常生活中经常以某种形式遇到概率。它可能是明确的,比如中奖的概率;也可能是隐含的,比如过马路而不被撞的概率。我们常常需要基于概率做出决策,例如是否带伞取决于我对下雨概率的判断。生活中的大多数方面都涉及某种概率,医学亦然。心脏移植患者活两年的概率是多少?患者对某种治疗有反应的概率是多少?腹痛患者患溃疡的概率是多少?在有适当数据的前提下,统计方法能帮助回答这些问题。但必须记住,统计分析很少给出确定答案,因此我们应当指出(或至少意识到)答案中存在一定的不确定性。
Another essential concept in the application of statistical methods is that of probability. We frequently encounter probability in some form in everyday life. It may be reasonably explicit, such as the probability of winning a lottery, or implicit, such as the probability of crossing the road without getting run over. Often we need to judge probability in relation to a decision that has to be taken, for example, whether I take an umbrella when I go out will depend on my perception of the probability of rain. Most aspects of life can be shown to involve some probabilities, and medicine is no exception. What is the probability of a heart transplant patient living for two years? What is the probability that a patient will respond to a particular treatment? What is the probability that a patient with a pain in his stomach has an ulcer? Given appropriate data, statistical methods help to answer many questions like these. It must be remembered, though, that statistical analysis rarely leads to a definite answer, so that we should indicate (or at least be aware of) a degree of uncertainty in our answers.

4.2 概率 4.2 PROBABILITY

首先,我们需要考虑概率的数学本质。就本书所述统计方法而言,我将某一特定结果的概率定义为:如果我们重复进行实验或观察大量次,该结果出现的比例。例如,我们可以通过观察大量婴儿中男婴的比例来估计男婴的概率。
First, we need to consider the mathematical nature of probability. For the purposes of the statistical methods described in this book I shall define the probability of some specific outcome as the proportion of times that that outcome would occur if we repeated the experiment or observation a large number of times. For example, we can estimate the probability that a baby is a boy by observing what proportion of a large number of babies are boys.

根据定义,概率介于0和1之间;不可能发生的事件概率为0,必然发生的事件概率为1。概率因此类似于比例或百分比:概率为0.2意味着发生的机会是五分之一,即20%。概率通常不以百分比形式表达。实际上,我们大多数概率都是估计值,因为无法知道其真实值。
By definition a probability lies between 0 and 1; something that cannot happen has a probability of 0, while something that is certain to happen has a probability of 1. A probability is thus somewhat similar to a proportion or a percentage: an outcome with a probability of 0.2 means that there is a one in five, or a chance of it happening. Probabilities are not usually expressed as a percentage. In practice we have to estimate most probabilities, as there is no way of knowing the true value.

此处需考虑概率的两个简单规则:
There are two simple rules regarding probabilities that we need to consider at this stage:

1.对于给定事件,任意两个可能结果发生的概率之和等于这两个结果中任一发生的概率。

  1. For a given event, for any two outcomes that might happen the probability of either occurring is the sum of the individual probabilities.

例如,若某人血型为A的概率是0.43,血型为B的概率是0.08,则血型为A或B的概率为0.51。由此可知,所有可能结果的概率之和必须为1,因为其中必有一项发生。例如,不同血型的概率大致为
For example, if the probability of an individual being blood group A is 0.43 and of being group B is 0.08, then the probability of being either A or B is 0.51. It follows that the probabilities of all possible outcomes must add up to 1, since one of these possibilities must occur. For example, the probabilities of being in the different blood groups are approximately

O型:0.46;A型:0.43;B型:0.08;AB型:0.03。
O:0.46;A:0.43;B:0.08;AB:0.03.

我们这里假设所有结果是互斥的。
We assume here that all outcomes are mutually exclusive.

2.如果我们考虑两个或多个相互独立的不同事件,那么要得到每个事件特定结果组合的概率,必须将这些结果的个别概率相乘。
2. If we consider two or more different events which are independent of each other, then to get the probability of a combination of specific outcomes for each of the events we must multiply the individual probabilities of those outcomes.

独立性的概念是一个基本的统计学概念。所谓独立,意味着如果我们知道一个事件的结果,这对另一个事件的结果没有任何信息。更正式地说,第二个事件的每个可能结果的概率与第一个事件的结果无关,依此类推。例如,如果一家全科医生诊室里有三个人,他们都是O型血的概率是 ,也就是说,不到十分之一的概率。在这个语境中,独立性要求这三个人彼此无关。
The idea of independence is an essential statistical concept. By independent we mean that if we know the outcome of one event this tells us nothing about the other event. More formally, the probability of each possible outcome for the second event is the same regardless of the outcome for the first event, and so on. For example, if there are three people in a GP's waiting room, the probability that they are all blood group O is , that is, there is less than one chance in ten. In this context independence requires the three people to be unrelated.

正如我们所预期的,如果两个事件不独立,则乘法性质不适用。例如,一个男人身高超过六英尺的概率是0.2,那么他和他儿子都超过六英尺的概率不是 ,因为孩子的身高往往与父母的身高有关。
As we would expect, if two events are not independent, the multiplicative property does not apply. For example, if the probability of a man being more than six feet tall is 0.2, the probability that both he and his son are over six feet is not because the heights of children tend to

在不确定情况下,这一思想被反向利用来调查两个事件是否独立。例如,在病例对照研究中,将患病患者(病例)与无病者(对照)在某些可能有害的早期暴露方面进行比较。宫颈癌女性可能与对照组比较过去使用口服避孕药的情况。如果病例组中暴露者多于对照组,则病例和对照的暴露概率不同,从而怀疑该暴露是疾病的原因。换句话说,患病与暴露不是独立事件。
be related to the heights of their parents. This idea is used in reverse in cases of uncertainty to investigate whether two events are independent. For example, in a case- control study patients with a disease (cases) are compared with people without the disease (controls) with respect to some possibly hazardous exposure earlier in their life. Women with cervical cancer may be compared with controls with respect to past use of oral contraceptives. If more cases had the exposure than controls then the probability of having been exposed is different for cases and controls and one suspects the exposure as a cause of the disease. Another way of looking at this is to say that having the disease and having had the exposure are not independent events.

4.3 样本与总体 4.3 SAMPLES AND POPULATIONS

几乎所有统计分析都基于这样一个原则:通过对样本个体收集数据,利用这些信息对所有此类个体做出推断。这一思想在民意调查中最为常见。所有研究对象(或被调查对象)的集合称为感兴趣的总体。上一章中,展示了25名囊性纤维化患者和298名6个月至6岁正常儿童的数据。分析这些样本数据使我们能够对总体做出推断。对于这些研究,感兴趣的总体分别是所有囊性纤维化患者和所有6个月至6岁的儿童。样本的选择方式显然非常重要,下一章将讨论。
Nearly all statistical analysis is based on the principle that one acquires data on a sample of individuals and uses the information to make inferences about all such individuals. This idea is probably most familiar in the context of opinion polls. The set of all subjects (or whatever is being investigated) is called the population of interest. In the previous chapter data were presented for 25 patients with cystic fibrosis and for 298 normal children aged 6 months to 6 years. Analysing the data from these samples enables us to make inferences about the population. For these studies the populations of interest were, respectively, all patients with cystic fibrosis and all children aged 6 months to 6 years. The way the sample is selected is clearly very important, and is discussed in the next chapter.

我们采样进行研究是因为几乎不可能研究整个总体。我们或许能研究某国某一特定日期诊断为囊性纤维化的所有患者,但他们仍只是囊性纤维化患者总体的一个样本,受时间和地域限制,且未诊断病例被排除。幸运的是,我们不需要研究整个总体,因为精心选择的样本可以提供可靠的答案。我们通常无法计数或识别总体的所有成员,但样本允许我们对总体做出集体和个体的推断。例如,一项新药抗高血压效果的研究可以(在一定范围内)估计该药对未来未参加研究的高血压患者的潜在益处。
We take samples to study because it is rarely, if ever, possible to study the whole population. We might be able to study all patients diagnosed as having cystic fibrosis in one country on a particular date, but they are still only a sample of all people with cystic fibrosis, restricted by time and geography, and undiagnosed cases are excluded. Fortunately we do not need to study the whole population, as a carefully chosen sample can yield reliable answers. We cannot usually count or identify all the members of the population, but the sample allows us to draw inferences about the population, both collectively and individually. For example, a study of the anti- hypertensive effect of a new drug would allows us to estimate (within limits) the possible benefit of the drug to future hypertensive patients not in the study.

样本与总体之间的关系存在不确定性,我们用概率的概念来表示这种不确定性。理论概率分布的概念在此非常重要。
The relation between sample and population is subject to uncertainty, and we use ideas of probability to indicate this uncertainty. The idea of a theoretical probability distribution is important in this context.

4.4 概率分布 4.4 PROBABILITY DISTRIBUTIONS

在上一章中,我讨论了观察数据的分布这一概念—经验分布。许多统计方法使用相关的概率分布概念,该分布通过数学方式加以描述。
In the previous chapter I discussed the idea of a distribution of observed data - an empirical distribution. Many statistical methods use the related

概率分布用于计算不同数值出现的理论概率,因此是经验相对频率分布的理论对应物。
idea of a probability distribution which is specified mathematically. A probability distribution is used to calculate the theoretical probability of different values occurring, and is thus a theoretical equivalent of an empirical relative frequency distribution.

例如,如果我们知道成年男性身高的均值和标准差,且假设总体身高分布符合某一特定概率分布,就可以计算身高超过六英尺的概率。如果观察到男婴比例为0.52,我们可以利用这一事实和数学方法计算一位有四个孩子的女性恰好有四个女儿的概率。0.52是概率分布的一个参数,均值和标准差也是概率分布的参数。所有概率分布均由一个或多个参数描述。
For example, if we know the mean and standard deviation of the height of adult men we can calculate the probability of being more than six feet tall if we assume that the distribution of height in the population is the same as a particular probability distribution. If we know from observation that the proportion of babies that are boys is 0.52, we can use this fact together with some mathematics to find the probability that a woman with four children has four daughters. The value 0.52 is a parameter of the probability distribution, as are the mean and standard deviation in the first example. All probability distributions are described by one or more parameters.

许多统计方法基于这样一个假设:观察数据是来自具有已知理论分布形式的总体的样本。如果这一假设合理(虽然我们无法确定其真实性),统计分析方法的使用就简单且范围广泛。若分布假设不合理而我们仍按此假设进行分析,可能会得到误导性(无效)的结果。分析数据时,我们可以选择基于分布假设的方法,称为参数方法,或不作分布假设的方法,称为无分布假设或非参数方法。
Many statistical methods are based on the assumption that the observed data are a sample from a population with a distribution that has a known theoretical form. If this assumption is reasonable (we cannot establish if it is true) then the statistical methods of analysis are simple to use and wide- ranging. If the distributional assumption is not reasonable and we proceed as if it were, then we may end up with misleading (and invalid) answers. When analysing data we have a choice between methods that make distributional assumptions, called parametric methods, and those which make no assumptions about distributions, called distribution- free or non- parametric methods.

概率分布在统计分析中的重要性反映了参数方法的主导地位。首先,我将考虑连续变量的概率分布,其中一种分布—正态分布,具有根本性的意义。随后,我将讨论离散数据的概率分布。
The importance of probability distributions in statistical analysis reflects the dominance of parametric methods. First I shall consider probability distributions for continuous variables, for which one distribution in particular, the Normal distribution, is of fundamental importance. Later I shall look at probability distributions for discrete data.

4.5 正态分布 4.5 THE NORMAL DISTRIBUTION

正态分布是统计学中迄今为止最重要的概率分布。它以某种形式出现在以下大多数章节中,原因将在第8章中更详细地讨论,因此理解其性质和作用至关重要。然而,为了强调这并不意味着该分布比其他许多分布更“正常”,我使用大写字母N表示正态分布。(它有时也被称为高斯分布,以数学家高斯的名字命名。)
The Normal distribution is by far the most important probability distribution in statistics. It appears in some form in most of the following chapters, for reasons which are considered more fully in Chapter 8, so an understanding of its nature and role is essential. However, to emphasize that there is no implication that this distribution is more 'normal' than many others, I use a capital N for Normal. (It is also sometimes known as the Gaussian distribution, after the mathematician Gauss.)

在上一章中,我展示了如何用直方图来描绘一组连续变量观测值的分布。如果有成千上万的观测值,且IgM的测量更精确,IgM值可以被划分为许多极小的区间,数据的直方图看起来会更像一条平滑的曲线。因此,它并不是
In the previous chapter I showed how a histogram can be used to depict the distribution of a set of observations of a continuous variable. If there had been thousands of observations, and IgM had been recorded more precisely, the IgM values could be divided into many tiny intervals, and a histogram of the data would appear more like a smooth curve. So it is not

很难想象某些观测数据的直方图或频率多边形是某种“潜在”平滑频率分布的近似。例如,图4.1显示了216名原发性胆汁性肝硬化患者血清白蛋白值的直方图,图4.2则显示了
difficult to imagine that the histogram or frequency polygon of some observed data is an approximation to some 'underlying' smooth frequency distribution. For example, Figure 4.1 shows a histogram of serum albumin values in 216 patients with primary biliary cirrhosis, and Figure 4.2 shows a


图4.1 216名原发性胆汁性肝硬化患者血清白蛋白值的直方图(摘自Christensen等人,1985年的研究)
Figure 4.1 Histogram of serum albumin values in 216 patients with primary biliary cirrhosis (from the study by Christensen et al., 1985)


图4.2 216名原发性胆汁性肝硬化患者血清白蛋白值的频率多边形图。
Figure 4.2 Frequency polygon of serum albumin values in 216 patients with primary biliary cirrhosis.

同一数据的频率多边形图,其中效果更为明显。
frequency polygon of the same data, in which the effect is rather clearer.

连续测量的频率分布,如图4.2所示,通常只有一个峰值:称为单峰分布。它们可能相当对称,如此处,或者不对称,如第3章中讨论的IgM数据。正态分布是一种单峰且对称的概率分布;其形状见图4.3。偶尔会见到双峰频率分布,称为双峰分布,通常是不同均值的亚组混合造成的。
frequency distributions for continuous measurements, such as in Figure 4.2, tend to have a single peak: they are called unimodal. They may be fairly symmetric, as here, or asymmetric, as with the IgM data discussed in Chapter 3. The Normal distribution is a probability distribution which is unimodal and symmetric; its shape is shown in Figure 4.3. Frequency distributions with two peaks are occasionally seen. These are called bimodal, and are usually the result of mixing subgroups with different means.


图4.3 正态分布。
Figure 4.3 The Normal distribution.

在考虑如何利用正态分布之前,有一些关于连续概率分布的一般性说明。首先,它们通常没有上限,有些甚至没有下限。理论上,正态分布从负无穷大延伸到正无穷大。其次,频率曲线的高度,即概率密度,不能被视为某一特定值的概率。因为连续变量可能的取值无限多,任何具体值的概率都是零。曲线高度本身无实际意义;其数值由曲线下的总面积恒为1决定。与观察数据的直方图类似,我们通过考虑对应某一特定区间的面积来使用概率分布。由于总面积为1,该面积即对应该区间内取值的概率。举个简单例子,正态分布均值左侧的面积为0.5(因对称性),这就是小于均值的概率。
Before considering how we make use of the Normal distribution there are some general points to note about continuous probability distributions. First, they usually have no upper limit and some have no lower limit either. In theory the Normal distribution extends from minus infinity to plus infinity . Second, the height of the frequency curve, which is called the probability density, cannot be taken as the probability of a particular value. This is because for a continuous variable there are infinitely many possible values so that the probability of any specific value is zero. The height of a curve is not of any practical use; its value is determined by the fact that the total area under the curve is always taken to be 1. As with histograms of observed data, we use a probability distribution by considering the area corresponding to a particular restricted range of values. Because the total area is 1 this area corresponds to the probability of those values. To take a simple example, the area to the left

正态分布均值左侧的面积是0.5(因为对称性),这就是小于均值的概率。
of the mean of the Normal distribution is 0.5 (because of the symmetry) and this is the probability of being below the mean.

4.5.1 正态分布的应用 4.5.1 Using the Normal distribution

正态分布的数学表达式较为复杂,但我们无需直接处理它即可使用正态分布,因为相关信息已在表格中提供。然而,重要的是要知道正态分布完全由两个参数描述,即均值和标准差,通常分别用(mu)和(sigma)表示。图4.4(a)展示了正态分布与这两个参数的关系。无论均值和标准差取何值,正态分布总是如图4.4(a)所示与均值和标准差相关。图4.4(b)和4.4(c)进一步说明了这一点,分别展示了均值为10、标准差为2和均值为125、标准差为8的正态分布。图4.5显示了图4.1中血清白蛋白的直方图及其对应均值和标准差的正态分布,两者明显非常相似。
The mathematical equation of the Normal distribution is unpleasantly complicated, but we do not need to deal with it in order to use the Normal distribution, because the necessary information is readily available in tables. However, it is important to know that the Normal distribution is completely described by two parameters, the mean and the standard deviation. These are usually called (mu) and (sigma) respectively. Figure 4.4(a) shows the Normal distribution in relation to these parameters. Whatever values the mean and standard deviation have, the Normal distribution is related to the mean and standard deviation in the manner shown in Figure 4.4(a). This feature is illustrated by Figures 4.4(b) and 4.4(c), which show Normal distributions with firstly mean 10 and standard deviation 2 and secondly mean 125 and standard deviation 8. Figure 4.5 shows the histogram of serum albumin shown in Figure 4.1 and the Normal distribution with the same mean and standard deviation. The two are clearly very similar.

如图4.4(a)所示,横轴上的任意位置都可以表示为距离均值若干标准差(正或负)的距离。该距离称为标准正态偏差或正态分数。它相当于观察一个均值为0、标准差为1的正态分布,这种特殊的正态分布称为标准正态分布。任何正态分布都可以通过减去均值再除以标准差转换(或标准化)为标准正态分布。
As Figure 4.4(a) shows, any position along the horizontal axis can be expressed as a distance of a number of standard deviations (negative or positive) from the mean. This distance is known as a standard Normal deviate or Normal score. It is equivalent to looking at a Normal distribution with a mean of 0 and a standard deviation of 1, a special Normal distribution known as the standard Normal distribution. Any Normal distribution can be converted (or transformed) into a standard Normal distribution by subtracting the mean and dividing by the standard deviation.

我们使用正态分布的一种方式如下。当一组观察值的分布与正态分布相似时,我们假设总体中该变量的分布实际上是正态的,并基于此进行计算。例如,如果我们愿意假设所有原发性胆汁性肝硬化患者的血清白蛋白在总体中呈正态分布,就可以计算某患者血清白蛋白水平高于 的概率。
One way that we use the Normal distribution is as follows. When a set of observations has a distribution that is similar to a Normal distribution we assume that in the population the distribution of the variable actually is Normal, and carry out calculations on this basis. For example, we can calculate the probability that a patient with primary biliary cirrhosis has a serum albumin level greater than if we are willing to assume that, among the population of all patients with primary biliary cirrhosis, serum albumin has a Normal distribution.

附录B中的表B1显示了标准正态分布的下尾面积。下尾指的是曲线下从 到感兴趣值的面积。该面积等同于取值低于指定值的概率。这个概念也可以用累积相对频率分布来表达,如图4.6所示。表B1只是图4.6曲线的更精确版本。
Table B1 in Appendix B shows the lower tail areas of the standard Normal distribution. The lower tail means the area under the curve from up to the value of interest. This area is equivalent to the probability of a value lower than the specified value. This idea can also be expressed as the cumulative relative frequency distribution, which is shown in Figure 4.6. Table B1 is simply a more accurate version of the curve in Figure 4.6.

低于 的面积为0.16,低于 的面积为0.84,因此
The area below is 0.16 and the area below is 0.84, so that the


图4.5 显示了216个血清白蛋白值的直方图及其相同均值和标准差的正态分布曲线。
Figure 4.5 Histogram of 216 serum albumin values and the Normal distribution with the same mean and standard deviation.


图4.6 显示了拟合血清白蛋白数据的累积正态分布曲线。
Figure 4.6 Cumulative Normal distribution fitted to serum albumin data.

对应范围 的面积为 。换言之,对于完全正态分布的数据,有0.68的概率落在均值一个标准差范围内。对其他标准差倍数重复此计算,得到:
area corresponding to the range to is . In other words, for data with an exactly Normal distribution there is a probability of 0.68 of being within one standard deviation of the mean. Repeating these calculations for other numbers of standard deviations we get

范围在范围内的概率范围外的概率
均值 ±1SD0.6830.317
均值 ±2SD0.9540.046
均值 ±3SD0.99730.0027
RangeProbability of being within rangeoutside range
mean ±1SD0.6830.317
mean ±2SD0.9540.046
mean ±3SD0.99730.0027

这些值也可从表B2获得。每种情况下,不在指定范围内的概率等于1减去在范围内的概率。我们看到,正态分布值超过均值正负三倍标准差的概率极小—0.0027,即0.27%,约为400分之一,这与图4.4的直观印象一致。当然,在非常大的样本中,我们仍会期望出现几个如此极端的值。
These values can also be obtained from Table B2. In each case the probability of not being within the stated range is 1 minus the probability of being within the range. We can see that there is a minimal chance - - 0.0027 or , or about 1 in 400 - that a value from a Normal distribution will be more than three standard deviations above or below the mean, agreeing with the visual impression gained from Figure 4.4. Of course, in very large samples we would expect several values to be this extreme.

落在均值正负两倍标准差范围内的概率略高于0.95。换言之,约95%的正态分布观测值会落在均值 的范围内,这与上一章的更一般性陈述相符。正如我们稍后将看到的,正态分布曲线下恰好95%的面积实际上落在稍窄的范围均值 内。
The probability of being within two standard deviations of the mean is just over 0.95. In other words, about of observations from a Normal distribution will be within the range mean to mean , which agrees with the more general statement in the previous chapter. As we will see later, exactly of the area under the Normal distribution curve actually falls in the slightly narrower range of mean .

4.5.2 一个例子 4.5.2 An example

回到血清白蛋白数据,我们可以计算在假设真实分布为正态分布的前提下,数值大于42.0的概率。血清白蛋白的平均水平为 ,标准差为 。我们首先计算42 离均值多少个标准差,计算公式为
Returning to the serum albumin data, we can calculate the probability of a value being above 42.0 on the assumption that the true distribution is Normal. The mean serum albumin level was and the standard deviation was . We first calculate how many standard deviations from the mean the value of is, which is given by

从表B1中查得,大于1.29的概率为0.0985,因此数值大于 的概率约为
From Table B1 we find that the probability of being greater than 1.29 is 0.0985, so the probability of a value above is .

从表B3中可以找到包含给定百分比分布的区间—中心范围。例如,90% 的分布位于均值 之间,95% 位于均值 之间,
From Table B3 we can find the values which enclose a given percentage of the distribution - the central range. For example, of the distribution lies within the range mean , within mean , and

99% 位于均值 之间。对于血清白蛋白数据,得到如下范围:
within mean . For the serum albumin data we get the following ranges:

中心范围血清白蛋白 (g/l)
90%24.85 到 44.07
95%23.01 到 45.91
99%19.39 到 49.53
Central rangeSerum albumin (g/l)
90%24.85 to 44.07
95%23.01 to 45.91
99%19.39 to 49.53

因此,我们可以使用正态分布来估计总体中该变量分布的百分位数。我们本可以计算样本数据的观察百分位数并用作总体百分位数的估计,但当数据接近正态分布时,使用正态分布更为可靠,尤其是在分布的尾部。此外,方法也更简便,只需两个数值和正态分布表,而不必使用完整的原始数据集。图4.5显示,216个血清白蛋白值的分布与具有相同均值和标准差的正态分布非常相似。我们可以用刚才描述的方法,从正态分布计算出直方图各区间内预期的数值。例如,区间26.0到 内预期的数值是该区间概率乘以216。26.0和28.0对应的标准正态偏差为
We can thus use the Normal distribution to estimate the centiles of the distribution of the variable in the population. We could have calculated the observed centiles of the sample data and used these values as estimates of the population centiles, but when the data are near to Normal the use of the Normal distribution is more reliable, especially in the tails of the distribution. It is also easier, requiring just two values and a table of the Normal distribution rather than the complete set of raw data values. Figure 4.5 showed that the distribution of the 216 serum albumin values was very similar to the Normal distribution with the same mean and standard deviation. We can use the procedure just described to calculate from the Normal distribution the number of values expected in each interval of the histogram. For example, the number expected in the interval 26.0 to is the probability of being in that interval multiplied by 216. The standard Normal deviates for 26.0 and 28.0 are

从表B1查得下尾概率分别为0.0735和0.1335,两者之差为 ,即26.0到28.0之间的概率为0.0600。因此该区间内预期观测数为 。表4.1展示了类似(但更精确)计算的结果,涵盖所有数值区间,列出了观察频数和若总体血清白蛋白分布为均值 和标准差 的正态分布时的预期频数。注意,预期频数通常以小数形式表示,尽管观察频数必须是整数。
From Table B1 we get lower tail areas of 0.0735 and 0.1335, giving a probability of of being between 26.0 and 28.0. The expected number of observations in this interval is thus . Table 4.1 shows the results of similar (but more precise) calculations for the whole range of values, giving observed frequencies and the frequencies expected if the population distribution of serum albumin was a Normal distribution with the same mean and standard deviation . Note that expected numbers are usually quoted as fractions even though the observed frequencies must be whole numbers.

我在第4.4节中指出,广泛使用的参数统计分析方法包含了关于数据分布的重要假设。在大多数情况下,所涉及的分布是正态分布,这也是其成为最重要分布之一的原因。
I observed in section 4.4 that the widely- used parametric methods of statistical analysis incorporate important assumptions about the distribution of data. In most cases the distribution involved is the Normal distribution. which is one of the reasons why it is the most important distribution in

表4.1 216例原发性胆汁性肝硬化患者血清白蛋白的分布及基于相同均值和标准差的正态分布预期频数
Table 4.1 Distribution of serum albumin in 216 patients with primary biliary cirrhosis together with expected frequencies based on a Normal distribution with the same mean and standard deviation

血清白蛋白 (g/l)观察频数预期频数
< 2001.4
20-22.1
22-64.4
24-78.0
26-913.1
28-2119.1
30-2024.7
32-2828.5
34-3929.2
36-2826.7
38-2221.8
40-1215.8
42-1110.2
44-45.9
46-33.0
48-11.4
50-10.6
52-10.2
54-00.1
56-10.0
总计216216.2
Serum albumin (g/l)Observed frequencyExpected frequency
&lt; 2001.4
20-22.1
22-64.4
24-78.0
26-913.1
28-2119.1
30-2024.7
32-2828.5
34-3929.2
36-2826.7
38-2221.8
40-1215.8
42-1110.2
44-45.9
46-33.0
48-11.4
50-10.6
52-10.2
54-00.1
56-10.0
Total216216.2

统计学。虽然许多测量值确实近似服从正态分布,比如人体身高,但许多则不然,如人体体重或血清胆固醇。数据偏离正态性的方式多种多样,尤以非对称或偏斜为主。图3.3中展示的IgM数据即表现出正偏斜。不能假设一组观察值近似正态分布—必须加以验证。与正态分布密切相关的一种常见偏斜分布是对数正态分布,下一节将对此进行讨论。
statistics. Although many measurements do have a reasonably Normal distribution, such as human height, many do not, such as human weight or serum cholesterol. There are various ways in which data may deviate from Normality, notably by being asymmetric or skewed. The IgM data shown in Figure 3.3 illustrated positive skewness. It should not be assumed that a set of observations is approximately Normal - this must be established. One common type of skewed distribution closely related to the Normal distribution is the Lognormal distribution, which is discussed in the next section.

4.5.3 抽样变异 4.5.3 Sampling variation

图4.5展示了血清白蛋白观察值与具有相同均值和标准差的正态分布的直观比较。是否可以认为数据足够接近正态分布的问题
Figure 4.5 showed a visual comparison of a set of observations of serum albumin and the Normal distribution having the same mean and standard deviation. The question of whether data are close enough to a Normal

分布很重要,并将在后续章节的多个部分中加以讨论。
distribution is important, and will be considered at various points in the following chapters.

虽然可以使用正式的方法(第7章中有描述),但判断一组观察值是否近似正态分布通常依赖于主观判断,通常通过观察直方图来完成。观察从正态分布中随机抽取样本所得的分布作为参考,有助于判断一组观测数据的分布情况。图4.7展示了16个样本的频率直方图,每个样本包含50个从标准正态分布中随机抽取的观察值。每个样本相当于考虑从一个已知变量服从正态分布的总体中抽取的50个个体。这些样本的分布表现出较大的不规则性,关键特征如单峰性和对称性通常缺失。在判断观察数据是否可能来自正态分布时,尤其是样本量较小时,应牢记这一点。
Although formal methods can be used (described in Chapter 7), whether a set of observations are reasonably Normal is often a matter of judge­ ment, usually by visual inspection of a histogram. It is instructive to look at distributions obtained by taking random samples from a Normal distribu­ tion to give a reference against which to judge a set of observed data. Figure 4.7 shows frequency histograms of 16 samples of 50 observations sampled at random from the standard Normal distribution. Each sample is equivalent to considering 50 individuals sampled from a population known to have a Normal distribution for the variable of interest. There is considerable irregularity in the distributions of these samples, with the key properties of unimodality and symmetry generally absent. This figure should be borne in mind when considering whether observed data might have come from a Normal distribution, especially when the sample size is small.


图4.7 正态分布中16个样本(每个样本50个观测值)的分布。
Figure 4.7 Distributions of 16 samples of size 50 from the Normal distribution.

4.6 对数正态分布 4.6 THE LOGNORMAL DISTRIBUTION

在第3.4节中我们看到,在某些情况下,一组呈正偏态分布的数据通过取对数可以转化为对称分布。对偏态分布的数据取对数,通常会得到接近正态的分布。图4.8显示了同一216名原发性患者的血清胆红素水平的直方图。
In section 3.4 we saw that in some circumstances a set of data with a positively skewed distribution can be transformed into a symmetric distribution by taking logarithms. Taking logs of data with a skewed distribution will often give a distribution that is near to Normal. Figure 4.8 shows a histogram of serum bilirubin levels in the same 216 patients with primary


图4.8 原发性胆汁性肝硬化(PBC)患者216例血清胆红素值的直方图及拟合的正态分布(摘自Christensen等,1985年的研究)。
Figure 4.8 Histogram of serum bilirubin values in 216 patients with primary biliary cirrhosis with fitted Normal distribution (from the study by Christensen et al., 1985).

胆汁性肝硬化(PBC)。均值和标准差分别为60.7和 。叠加的最佳拟合正态分布(均值和标准差相同)与数据的拟合非常差,原因是数据极度偏斜。如果对数据取自然对数(以e为底),则得到一个更对称的分布,均值为3.55,标准差为 。图4.9显示了 血清胆红素的直方图及拟合的正态分布,拟合效果明显更好。图4.10显示了原始数据及拟合正态分布函数的“反变换”。拟合曲线是对数正态分布函数的一个例子。对数正态分布的数据可通过取对数转换为正态分布。
biliary cirrhosis (PBC). The mean and standard deviation are 60.7 and respectively. The superimposed best- fitting Normal distribution (with the same mean and standard deviation) is a terrible fit to the data because of the extreme skewness. If we take logs (to base e) of the data we get a much more symmetric distribution with a mean of 3.55 and a standard deviation of . Figure 4.9 shows a histogram of serum bilirubin with the fitted Normal distribution, which is a much better fit. Figure 4.10 shows the raw data with the 'back- transformation' of the fitted Normal distribution function. The fitted curve is an example of the Lognormal distribution function. Data with a Lognormal distribution can be transformed to Normality by taking logarithms.

对于像血清胆红素测量这样偏斜的数据,取对数转换通常能产生近似正态分布。我们可以在对数数据上进行计算,然后将结果转换回原始尺度。例如,我们可能希望用数据估计包含所有PBC患者95%血清胆红素水平的区间。假设数据服从对数正态分布,我们可以利用均值为3.547、标准差为1.030的正态分布进行计算(这些值比上文所示更准确)。在对数单位上,95%的分布预计位于均值减去 和均值加上 之间。具体数值为:
With skewed data like the serum bilirubin measurements log transformation will often produce approximate Normality. We can then perform our calculations on the log data and transform the answers back to the original scale. For example, we may wish to use our data to estimate the values enclosing of serum bilirubin levels for all patients with PBC. Assuming a Lognormal distribution, we can make our calculations from the Normal distribution with mean 3.547 and standard deviation 1.030 (these being more accurate values than those shown above). In log units, of the distribution will be expected to be between mean and mean . These values are


图4.9 血清胆红素对数值的直方图及拟合的正态分布(以e为底的对数)。
Figure 4.9 Histogram of log serum bilirubin with fitted Normal distribution (logarithms to base e).


图4.10 血清胆红素的直方图及拟合的对数正态分布。
Figure 4.10 Histogram of serum bilirubin with fitted Lognormal distribution.

这些数值的反对数(使用函数 )分别为 。对数数据均值的反对数为 ,即数据的几何均值。所有这些数值都在图4.11的箱线图中展示。
The antilogs of these values (using the function ) are and . The antilog of the mean of the log data is , which is the geometric mean of the data. All of these values are depicted in a box- and- whisker diagram in Figure 4.11.

不应假设偏斜分布的数据都能通过转换近似为正态分布。必须通过视觉检查(如图4.9)或第7.5节描述的方法正式验证。
It should not be assumed that data with a skewed distribution can be transformed to approximate Normality. This must be established, perhaps visually as in Figure 4.9 or formally using the methods described in section 7.5.


图4.11 血清胆红素的箱线图,显示基于对数数据拟合正态分布得出的95%中心区间。
Figure 4.11 Box-and-whisker diagram of serum bilirubin showing central range derived from fitting a Normal distribution to log data.

4.7 二项分布 4.7 THE BINOMIAL DISTRIBUTION

离散数据中最简单的概率分布是只有两种可能性的情况。血型为B的概率约为0.08,因此血型为O、A或AB的概率为0.92。对于一组无关的人,我们可以计算不同人数属于血型B的概率。两个人都属于血型B的概率是 ,两个人都不属于血型B的概率是 。我们将概率相乘是因为两个人的血型是独立的。仅有一人属于血型B的概率更复杂,因为有两种情况可能发生。因此,恰有一人属于血型B的概率为
The simplest probability distribution for discrete data is when there are only two possibilities. The probability of being in blood group B is about 0.08 so the probability of being group O, A or AB is 0.92. For a group of unrelated people, we can work out the probability of different numbers of people being in blood group B. The probability of two people both being in blood group B is thus , and the probability of neither being in blood group B is . We multiply the probabilities because the blood groups of two unrelated people are independent. The probability of only one of the two being in blood group B is more complicated, because there are two ways in which this could happen. Thus the probability of exactly one of two people being in blood group B .

我们可以将这些可能性总结如下:
We can summarize the possibilities as follows:

64 理论分布
64 Theoretical distributions


图4.12显示了两人中属于血型B人数的概率分布。该分布是二项分布的一个简单例子。为了得到图中显示的三个概率,我们做了三次简单计算。然而,如果将计算扩展到四个人的情况,计算就不那么简单了。每个人要么是血型B,要么不是,因此可能的排列组合是 ,即16种。对于 个人,可能的排列组合数是 ,例如七个人则有128种可能。
Figure 4.12 shows the probability distribution of the number of people out of two in blood group B. This distribution is a simple example of the Binomial distribution. To get the three probabilities shown we had to make three simple calculations. However, if we extend this simple calculation to consider the number of people out of four it is not so easy. Each person is either group B or not group B so there are possible orderings, which is 16. The number of possible orderings for people is , so if we have seven people for example, there are 128 possible orderings.

幸运的是,我们可以使用一个通用公式来跳过大部分计算。由于公式较复杂,详细内容见第4.9节。利用该公式,可以根据单个事件发生的概率,计算一系列事件中某种结果出现不同次数的概率。例如,图4.13显示了概率分布。
Fortunately, we can bypass most of the calculations by using a general formula. As it is rather complicated, the details are given in section 4.9. Using the formula one can calculate the probability of different numbers of outcomes of a particular type in a series of events from the probability of one such outcome. For example, Figure 4.13 shows the probability


图4.12 两人中属于血型B人数的二项分布。
Figure 4.12 Binomial distribution of number of people out of two in blood group B.

属于血型B的10人中人数的概率分布。(计算过程见第4.9节。)该分布是不对称的,但随着样本量增加,二项分布会变得
distribution for the number of individuals out of 10 being of blood group B. (The calculations are shown in section 4.9. ) The distribution is asymmetric, but as the sample size increases the Binomial distribution becomes


图4.13 基于血型B概率为0.08,显示10人中属于血型B人数的二项分布。
Figure 4.13 Binomial distribution showing the number of subjects out of ten in blood group B based on the probability of being in blood group B of 0.08.


图4.14 基于血型B概率为0.08,显示100人中属于血型B人数的二项分布。
Figure 4.14 Binomial distribution showing the number of subjects out of 100 in blood group B based on the probability of being in blood group B of 0.08.

越来越对称,并逐渐趋近于正态分布。图4.14显示,100人样本中血型B人数的二项分布几乎是对称的。
more symmetric and gradually begins to look like a Normal distribution. Figure 4.14 shows that the Binomial distribution for the number of people in blood group B in a sample of 100 is almost symmetric.

二项分布有时用于比较观察到的数据集与预期分布。其主要用途是在只有两种可能性的情况下分析数据,例如某人是否患有哮喘。这里我们关注的是患哮喘受试者的比例。这类数据在医学研究中经常出现,我们常常希望比较不同受试者组中某类事件发生的比例。组内样本量通常足够大,二项分布近似于具有相同均值和标准差的正态分布,从而简化分析(见第10章)。
The Binomial distribution is sometimes used to compare an observed set of data with the expected distribution. Its main use, however, is in the analysis of data where there are only two possibilities, such as whether or not someone suffers from asthma. Here we are interested in the proportion of subjects with asthma. Data of this type occur frequently in medical research, and we often wish to compare the proportion of events of a certain type occurring in different groups of subjects. The sample sizes in the groups are often large enough for the Binomial distribution to be very like a Normal distribution with the same mean and standard deviation, which simplifies analysis (see Chapter 10).

4.8 泊松分布 4.8 THE POISSON DISTRIBUTION

另一种离散数据类型是计数事件发生的次数,可能针对不同受试者或时间单位。此类数据的例子包括癌症登记处每日报告的新乳腺癌病例数,以及一系列肝活检组织切片中固定面积的异常细胞数。
A different type of discrete data arises when we count the number of occurrences of an event, perhaps for different subjects or for units of time. Examples of data like this are the daily number of new cases of breast cancer notified to a cancer registry, and the number of abnormal cells in a fixed area of histological slides from a series of liver biopsies.

这类数据的理论背景最易用事件随时间(或空间)以固定平均速率独立随机发生来描述。此类数据服从泊松分布。例如,癌症每日新登记病例平均为2.2,但某天可能无新病例,也可能有多例。若假设泊松分布条件成立,我们可以计算某天出现任意新病例数的概率。图4.15展示了这些概率(计算过程见4.9节)。
The theoretical situation giving rise to data of this type is easiest to describe in relation to events occurring over time (or space) at a fixed rate on average, but where each event occurs independently and at random. Such data will have a Poisson distribution. For example, the daily number of new registrations of cancer may be 2.2 on average, but on any day there may be no new cases or there may be several. If we assume that the conditions for a Poisson distribution hold, we can calculate the probability of any number of new cases on a single day. These probabilities are shown in Figure 4.15 (and the calculations are shown in section 4.9).

当均值较小时,泊松分布高度偏斜,如本例,但均值较大(如50)时,分布趋于对称。实际上,类似二项分布,泊松分布也趋近于正态分布。注意泊松分布无理论最大值,但概率迅速趋近于零。
The Poisson distribution is very asymmetric when its mean is small, as here, but with a large mean, such as 50, it becomes nearly symmetric. In fact, like the Binomial distribution, it becomes more like a Normal distribution. Note that the Poisson distribution has no theoretical maximum value, but the probabilities tail off towards zero very quickly.

表4.2展示了可能服从泊松分布的数据。该表记录了1978至1982年印度三个小区域在满月日和新月日的每日犯罪次数。表中还列出了基于与观察数据相同均值的泊松分布计算出的不同犯罪次数的预期天数。观察频数与预期频数尤其在新月日表现出高度一致,说明数据接近泊松分布。
Table 4.2 shows some data that might be expected to follow a Poisson distribution. The table gives the number of crimes per day in three small areas of India from 1978 to 1982, on days where there was either a full moon or a new moon. Also shown are the expected number of days with different numbers of crimes, based on Poisson distributions with the same means as the observed data. The similarity between the observed and expected frequencies is clear, especially for the new moon days, demonstrating that these data are close to a Poisson distribution.

泊松分布由单一参数完全描述。
The Poisson distribution is completely described by a single parameter.


图4.15 均值为2.2的泊松分布。
Figure 4.15 Poisson distribution with mean 2.2.

表4.2 1978至1982年印度三个区域每日犯罪次数(Thakur和Sharma,1984),显示观察频数(Obs)和基于泊松分布的预期频数(Exp)。
Table 4.2 Number of crimes per day in three areas of India during 1978 to 1982 (Thakur and Sharma, 1984) showing observed frequencies (Obs) and expected frequencies using the Poisson distribution (Exp)

犯罪次数满月日新月日
观察值预期值观察值预期值
04045.2114112.8
16463.15656.4
25644.31114.1
31920.742.4
417.110.3
522.000.0
600.500.0
700.100.0
800.000.0
910.000.0
总计183183.0186186.0
均值1.400.50
标准差1.160.75
Number of crimesFull moon daysNew moon days
ObsExpObsExp
04045.2114112.8
16463.15656.4
25644.31114.1
31920.742.4
417.110.3
522.000.0
600.500.0
700.100.0
800.000.0
910.000.0
Total183183.0186186.0
Mean1.400.50
SD1.160.75

如第4.9节所示,均值是如此,因为泊松分布的方差恰好等于均值。因此,如果来自不同来源的数据都能被视为接近泊松分布且具有相同均值,它们的分布将非常相似。
the mean, as is shown in section 4.9, because the variance of the Poisson distribution turns out to be the same as the mean. It follows that data from different sources will have very similar distributions if they can both be

表4.3 印度新月日犯罪分布(Thakur和Sharma,1984年)与1971年蒙特利尔一家医院每日死亡人数分布(Zweig和Csank,1978年)比较
Table 4.3 Comparison of distributions of crimes on new moon days (Thakur and Sharma, 1984) and number of deaths per day in a Montreal hospital in 1971 (Zweig and Csank, 1978)

n印度新月日犯罪蒙特利尔医院每日死亡泊松分布预期(0.51)
%频数%频数%
061.311460.322060.0
130.15631.011330.6
25.9116.3237.8
32.242.281.3
4+0.510.310.2
总计100.0186100.036599.9%
均值0.5050.512
标准差0.7520.736
nCrimes on new moon days in IndiaDeaths per day in a hospital in MontrealExpected distribution Poisson (0.51)
%Frequency%Frequency%
061.311460.322060.0
130.15631.011330.6
25.9116.3237.8
32.242.281.3
4+0.510.310.2
Total100.0186100.036599.9%
Mean0.5050.512
SD0.7520.736

如果数据都可视为接近泊松且均值相同,表4.3显示印度新月日犯罪数的相对频率分布与蒙特利尔医院每日死亡人数的分布几乎完全相同。两组观测数据都非常接近均值为0.51的泊松分布。
considered to be close to Poisson and have the same mean. Table 4.3 shows that the relative frequency distribution of the number of crimes on new moon days in India is virtually identical to the distribution of the number of deaths per day in a hospital in Montreal. Both observed sets of data are very close to a Poisson distribution with a mean of 0.51.

泊松分布适合研究罕见事件。我们可以将问题视为二项分布问题,其中感兴趣结果的概率非常小,但事件总数很大。虽然泊松分布在医学研究中的应用不多,但它像二项分布一样,在某些其他统计分析中隐含使用。
The Poisson distribution is appropriate for studying rare events. We can consider the problem as being the same as that of the Binomial distribution where the probability of the outcome of interest is very small but there are a large number of events. The Poisson distribution is not used greatly in medical research although, like the Binomial distribution, it is used implicitly in some other types of statistical analysis.

4.9 数学计算 4.9 MATHEMATICAL CALCULATIONS

(本节提供与二项分布和泊松分布相关的数学计算。可跳过而不影响内容连贯性。)
(This section gives the mathematical calculations relating to the sections on the Binomial and Poisson distributions. It can be omitted without loss of continuity.)

4.9.1 二项分布 4.9.1 Binomial distribution

举例来说,假设我们想计算10人中不同人数为B型血的概率,。例如,特定4人是B型血的概率为 ,因此任何4人是B型血的概率是该概率乘以从10人中选4人的组合数。
To take an example, suppose we wish to calculate the probability of different numbers of individuals out of ten being blood group B, for which . The probability of, say, a particular 4 of the 10 people being blood group B is , so that the probability of any 4 being blood group B is this probability multiplied by the number of ways of choosing 4 people from 10.

一般来说,假设有 个“事件”,我们想计算其中0、1、2直到 个为某特定类型的概率, 是该类型结果的总体概率。则 个此类事件的二项分布概率为
In general, suppose we have 'events' and wish to calculate the probability of 0, 1, 2, up to of them being a specific type, where is the overall probability of this type of outcome. Then the Binomial probability of such events is given by

其中, 是从 个项目中选择 个的方式数,这是一个需要计算的数值。
where is the number of ways of choosing items from , and is a number we have to calculate.

我们可以通过以下关系简单地计算
We can evaluate simply by using the following relations:

以及
and

(iii)
(iii) .

因此,我们有
So we have

因此,10人中有4人为B型血的概率是
The probability that 4 of the 10 people are blood group B is thus

或者为 。图4.13展示了完整的分布。
or . Figure 4.13 shows the complete distribution.

系数 的通用公式是
The general formula for the coefficients is

70 理论分布
70 Theoretical distributions

其中 (读作 n 的阶乘)等于 (见附录A)。注意 (见附录A)。系数 可以从 的表格中获得(Lentner, 1982,第74-81页),或按上述方法计算。
where (pronounced n factorial) is equal to (see Appendix A). Note that (see Appendix A). The coefficients can be obtained from tables of (Lentner, 1982, pp. 74- 81), or calculated in the way described above.

如果感兴趣事件的真实比例为 ,则样本容量为 时,二项分布的均值为 ,标准差为
If the true proportion of events of interest is , then in a sample of size the mean of the Binomial distribution is and the standard deviation is .

4.9.2 泊松分布 4.9.2 Poisson distribution

泊松分布中 个事件发生的概率通用公式为 ,其中 (希腊字母 mu)是均值, 是数学常数,约等于 2.718(见附录A)。标准差为
The general Poisson formula for the probability of events is where (the greek letter mu) is the mean and is a mathematical constant approximately equal to 2.718 (see Appendix A). The standard deviation is .

如果满足泊松分布的条件,则某天无新病例的概率为
If the conditions for a Poisson distribution hold, the probability of getting no new cases on a day is

最适合数据的泊松分布具有与观察值相同的均值:2.2。因此这里 。我们可以不用上述复杂公式,而通过关系式 计算 等,其中 是样本均值。于是我们有
The Poisson distribution that will fit the data best has the same mean as that of the observations: 2.2. So here is . Rather than use the complicated formula above we can calculate , , etc. from the relation , where is the sample mean. So we have

等等。该分布如图4.15所示。
and so on. The distribution is shown in Figure 4.15.

注意,这很好地说明了在一系列计算中保持完整数值精度的必要性,因为任何由舍入引起的误差都会影响后续所有计算。
Note that this is a good example of the need to keep full numerical precision through a series of calculations, because any error caused by

不过,上述数字为了便于展示已被四舍五入。
rounding would affect all subsequent calculations. The figures shown above have, however, been rounded to clarify the presentation.

4.10 均匀分布 4.10 THE UNIFORM DISTRIBUTION

另一个问题是确定疾病发病是否存在季节性变化。如果没有季节性变化,我们预期每个月新发病例数变化不大。例如,如果某区综合医院的糖尿病门诊一年登记了126例新发病例,且糖尿病发病无季节性,那么我们预期每个月新发病例数约为126的十二分之一,即10.5例。(我们可以对不同月份天数的差异作轻微修正。)实际上,自然变异会导致每月新发病例数有所波动,但若无季节性,这种波动是无规律的;若有季节性,则会呈现某种系统性趋势。理论上的均匀分布,即每个月相对频率相同的分布,用于检验此类数据。周期性变化的统计分析将在第14.7节讨论。
A different problem is that of determining whether there is a seasonal variation in the onset of a disease. If there is no seasonal variation we would expect little variation in the number of new cases each month. For example, if a diabetes clinic in a district general hospital registers 126 new cases in a year, and if there were no seasonality for the onset of diabetes, then we would expect to have one- twelfth of 126 or 10.5 new cases in every month. (We could make a slight correction for the variation in the number of days in a month.) In practice natural variability will lead to some variation in the monthly accrual of new cases, but this will be unsystematic if there is no seasonality, whereas there will be some systematic trend if there is seasonality. The theoretical Uniform distribution, which has the same relative frequency for each month, is used for examining such data. Statistical analysis of periodic variation is discussed in section 14.7.

4.11 结语 4.11 CONCLUDING REMARKS

理论分布在很大比例的统计分析中都有涉及。其中,正态分布是最重要的一种。除了许多分析假设数据服从正态分布外,正态分布在许多统计推断方法中也扮演核心角色,如第8章所述。
Theoretical distributions feature in some way in a large proportion of statistical analysis. The Normal distribution is by far the most important of those discussed. Apart from the assumptions of many analyses that the data follow a Normal distribution, there is also a central role for the Normal distribution in many methods of statistical inference, as described in Chapter 8.

本章未讨论的概率分布还有很多。这些大多是专用的,不会在本书中出现,但有三种在后续章节的统计分析中很重要:分布、分布和卡方分布。
There are many other probability distributions not discussed in this chapter. Most of these are of specialized use and will not appear in this book, but three are important in statistical analyses described in later chapters: the distribution, the distribution and the Chi squared distribution.

练习 EXERCISES

4.1 假设成年男性的身高服从正态分布,身高超过平均值两个标准差以上的男性比例是多少?
4.1 Assuming that the height of adult males has a Normal distribution, what proportion of males will be more than two standard deviations above the mean height?

4.2 血型为B的概率是0.08。如果从100名无关的献血者中各取一品脱血液,获得少于三品脱B型血的概率是多少?
4.2 The probability of being blood group B is 0.08. What is the probability that if one pint of blood is taken from each of 100 unrelated blood donors fewer than three pints of group B blood will be obtained?

4.3 新生儿是男孩的概率为0.52。在同一产房连续分娩的六位女性中,下列哪种男女婴儿的具体顺序最可能,哪种最不可能?
4.3 The probability of a baby being a boy is 0.52. For six women delivering consecutively in the same labour ward on one day, which of the following exact sequences of boys and girls is most likely and which least likely?

GBGBGB BBBGGG GBBBBB
GBGBGB BBBGGG GBBBBB

4.4 二项分布,参数为 ,概率如下:
4.4 The Binomial distribution with and is as follows:

r概率r概率
00.196960.0012
10.347470.0001
20.275980.0000
30.129890.0000
40.0401100.0000
50.0085
rProbabilityrProbability
00.196960.0012
10.347470.0001
20.275980.0000
30.129890.0000
40.0401100.0000
50.0085

(a) 如果所有怀孕中有15%导致流产,那么10名孕妇中超过一半流产的概率是多少?
(a) If of all pregnancies result in miscarriages, what is the probability that more than half of a group of ten pregnant women will have a miscarriage?

(b) 在使用视频显示终端的用户群中,有20000个足够大样本使得10名女性在一年内怀孕。如果我们将10人中6人或以上流产称为“聚集”,假设使用终端不增加流产风险,一年内预期会有多少个这样的“聚集”?(基于Blackwell和Chang,1988)
(b) Among groups of users of video display terminals there are 20000 large enough for ten women to become pregnant in one year. If we call six or more miscarriages out of 10 a 'cluster', how many clusters would we expect in one year, assuming that there is no increased risk of miscarriage associated with using a terminal? (Based on Blackwell and Chang, 1988)

4.5 如果学校中存在感染,预计会传播给10%的儿童
4.5 If an infection is present in a school it would be expected to spread to of the children

(a) 应检测多少儿童,才能以0.95(95%)的概率检测出感染?(提示:考虑如果学校存在感染,所有检测样本儿童均为阴性的概率。)
(a) How many children should be tested to have a probability of 0.95 of detecting the infection if it is there? (Hint: consider the probability of all the children in the sample being negative to the test if the infection is present in the school.)

(b) 学校中儿童人数对该计算有什么影响?
(b) What is the effect of the number of children in the school on this calculation?

4.6 在25年期间,成年男性的平均身高从 增加到 ,但标准差保持在 。警察部队对男性的最低身高要求是 。假设成年男性的身高服从正态分布,那么在这25年开始和结束时,身高不足以成为警察的男性比例是多少?
4.6 Over a 25 year period the mean height of adult males increased from to , but the standard deviation stayed at . The minimum height requirement for men to join the police force is . What proportion of men would be too short to become policemen at the beginning and end of the 25 year period, assuming that the height of adult males has a Normal distribution?

4.7 一位研究人员计划测量若干受试者的血压。他打算进行三次测量,但计划舍弃
4.7 A researcher plans to measure blood pressure in a number of subjects. He proposes to take three measurements, but intends to discard the

如果第三次测量值不在前两次测量值之间,则认为其不可靠。假设受试者的血压在测量过程中保持恒定,那么对于某个特定受试者,第三次测量值不落在另外两个值之间的概率是多少?(提示:答案不依赖于血压测量的变异性。)请对研究者的提议进行评论。
third measurement as unreliable if it does not fall between the first two measurements. Assuming that the subjects' blood pressure stays constant during the measuring, what is the probability that for a given subject the third value will not lie between the other two? (Hint: the answer does not depend upon the variability of blood pressure measurements.) Comment on the researcher's proposal.

【4】8 在英国,最常见的常染色体隐性遗传病是囊性纤维化,约每2000个活产儿中有1个受影响。如果双方父母均为该异常基因的杂合子,其子女患囊性纤维化的概率为1/4。
4.8 In Britain the commonest autosomal recessive disorder is cystic fibrosis, with about one in 2000 live births being affected. If both parents are heterozygous for the abnormal gene there is a 1 in 4 chance of their child having cystic fibrosis.

(a) 两个都是杂合子的夫妇生育两个未受影响孩子的概率是多少?
(a) What is the probability that a couple who are both heterozygous will have two unaffected children?

(b) 如果他们已经有四个未受影响的孩子,那么他们的第五个孩子未受影响的概率是多少?
(b) If they have four unaffected children, what is the probability that their fifth child would be unaffected?

(c) 大约每22个人中有一个是囊性纤维化的杂合子。在一家每年有3500个新生儿的医院里,假设没有遗传咨询,每年预期患囊性纤维化的婴儿数量是多少?
(c) About one in 22 people is heterozygous for cystic fibrosis. In a hospital where there are 3500 births a year, what is the expected number of babies per year affected by cystic fibrosis (assuming that there is no genetic counselling)?

5 研究设计 5 Designing research

临床研究中,可能没有哪一方面比研究设计更容易被忽视。热心的年轻研究者参加医学统计课程,学会了无数计算 值的方法,却很少学会如何正确组织一个临床研究项目。然而,严谨的研究设计是高质量临床研究的基础。
Probably no aspect of clinical research is as neglected as study design. Eager young investigators attend classes on medical statistics, find dozens of ways to compute values, but rarely learn how to organize a clinical research project properly. Yet careful study design is the foundation of quality clinical research.

Noller 和 Melton (1985)
Noller and Melton (1985)

正确做研究的方法屈指可数,但做错的方式却有千千万万。
There are only a handful of ways to do a study properly but a thousand ways to do it wrong.

Sackett (1986)
Sackett (1986)

5.1 引言 5.1 INTRODUCTION

所有医学研究都是围绕一个或多个目标进行的,这些目标应当指导研究的计划或设计。在某些情况下,推进研究有明确的最佳路径,但更多时候存在多种合理的设计方案。设计中的统计学内容主要涉及研究结构及数据收集的各个方面,包括测量指标的选择及其频率。虽然本章涵盖的许多一般性问题也适用于临床试验,但临床试验有许多特殊之处,详见第15章。
All medical research is carried out in relation to one or more objectives, which should focus the plan or design of the research. In some cases there is a clear best way to proceed, but more often there is a choice of reasonable ways of designing a study. The statistical aspects of design relate mainly to the structure of the study and all aspects of the collection of data, including the choice of measurements to make and their frequency. Although many of the general issues covered in this chapter apply to clinical trials, these have many special features and are discussed in detail in Chapter 15.

研究大致可分为观察性研究和实验性研究。观察性研究中,我们收集一个或多个受试者群体的信息,但不干预他们。观察性研究可以是前瞻性的,即招募受试者并收集其后续事件的数据;也可以是回顾性的,即收集过去事件的信息。观察性研究包括普查、调查、病例对照研究和队列研究,相关内容见第5.9至5.12节。
Research can be crudely divided into observational and experimental studies. In observational studies we collect information about one or more groups of subjects, but do nothing to affect them. Observational studies can be prospective, where subjects are recruited and data are collected about subsequent events, or retrospective, where information is collected about past events. Observational studies include censuses, surveys, casecontrol studies and cohort studies; they are considered in sections 5.9 to 5.12.

实验性研究是指研究者对所有或部分个体所发生的情况进行干预(控制)。类似的问题也出现在对人类、动物及实验室样本的研究中,尽管本章重点在临床研究。第5.4至5.8节讨论实验性研究的设计。
Experimental studies are those in which the researcher affects (controls) what happens to all or some of the individuals. Similar problems arise in studies of humans, animals and laboratory samples, although the emphasis

in this chapter is on clinical studies. Sections 5.4 to 5.8 consider the design of experimental studies.

大多数研究旨在回答相对简单的问题,但这并不意味着它们需要相对简单的设计。关键在于将研究设计与研究目标相匹配。没有充分的规划,研究者无法期望得出有意义的结论。本章后面将讨论一些重要的设计通则。
Most studies aim to answer fairly simple questions but it does not necessarily follow that they require fairly simple designs. The key point is to tailor the research design to the study objective(s). Without adequate planning the researcher cannot expect to be able to make meaningful conclusions. Some important general principles of design are discussed later in this chapter.

在大多数研究中,我们希望将研究结果推广到总体。对此有两个方面需要特别关注。首先,所研究的样本应具有代表性,能够反映感兴趣的总体;这点对观察性研究尤为重要。其次,比较的各组应尽可能相似,除了直接关注的特征外;这点在实验性研究(如临床试验)中特别重要,但在许多观察性研究(如病例对照研究)中也同样适用。下面我将回到这些问题进行讨论。
In most research we wish to extrapolate the results from a study to the population in general. There are two aspects that require particular attention in this respect. First, the sample(s) studied should be representative of the population(s) of interest; this applies especially to observational studies. Secondly, groups being compared should be as alike as possible apart from the features of direct interest; this applies particularly in experimental studies, such as clinical trials, but is also relevant in many observational studies, such as case- control studies. I return to these issues below.

研究设计可以说是统计学对医学贡献中最重要的部分。正因如此,50 多年来,统计学家一直敦促医学研究者在研究的规划阶段而非分析阶段就与他们沟通。一项良好设计的研究所产生的数据可以通过多种方式分析,但再巧妙的分析也无法弥补设计上的缺陷。
Research design is arguably the most important aspect of the statistical contribution to medicine. It is for this reason that for over 50 years statisticians have been urging medical researchers to consult them at the planning stage of their study, rather than at the analysis stage. The data from a good study can be analysed in many ways, but no amount of clever analysis can compensate for problems with the design of a study.

5.2 研究设计的类别 5.2 CATEGORIES OF RESEARCH DESIGN

研究设计可以按多种方式分类,部分分类如下:
Research designs can be classified in several ways, some of which are:

【1】观察性或实验性 observational or experimental;
【2】前瞻性或回顾性 prospective or retrospective;
【3】纵向或横断面 longitudinal or cross-sectional.

这些术语将在下文解释。第一种分类与研究目的相关,而其他分类描述数据的收集方式。并非所有分类的组合都是可能的,但大多数是。
These terms are explained below. The first classification relates to the purpose of the study, while the others describe the way in which the data are collected. Not all combinations of these classifications are possible, but most are.

5.2.1 观察性或实验性 5.2.1 Observational or experimental

在观察性研究中,研究者收集感兴趣的属性或测量数据,但不干预事件。例如,一项旨在发现小儿听力障碍患病率的研究即为观察性研究。观察性研究包括调查和大多数流行病学研究。相比之下,在实验性研究中,研究者有意干预事件并调查干预的效果。实验性研究包括临床试验以及许多动物和实验室研究。一般来说,实验性研究所能得出的推论比观察性研究更有力。实验性研究通常用于比较不同组别;观察性研究也可能具有比较性质,但往往本质上是描述性的。
In an observational study the researcher collects information on the attributes or measurements of interest, but does not influence events. An example would be a study to discover the prevalence of hearing difficulties in small children. Observational studies include surveys and most epidemiological studies. By contrast, in an experimental study the researcher

deliberately influences events and investigates the effects of the intervention. Experimental studies include clinical trials and many animal and laboratory studies. In general stronger inferences can be made from experimental studies than from observational studies. Experimental studies are usually carried out to make comparisons between groups; observational studies may also be comparative, but they are often essentially descriptive.

5.2.2 前瞻性或回顾性 5.2.2 Prospective or retrospective

前瞻性研究与回顾性研究有明显区别:前瞻性研究从研究开始起向前收集数据;回顾性研究则涉及过去事件的数据,可能来自现有资料,如病历,或通过访谈获得。实验通常是前瞻性的,但观察性研究既可以是前瞻性的,也可以是回顾性的。当然,可以获取回顾性数据来比较不同治疗方法,例如不同类型的乳房切除术,但这类研究不属于实验,因为它不是在标准化条件下预先设定的研究。回顾性研究包括病例对照研究(见第5.10节)。
There is a clear distinction between prospective studies, in which data are collected forwards in time from the start of the study, and retrospective studies, in which data refer to past events and may be acquired from existing sources, such as hospital notes, or by interview. Experiments are prospective, but observational studies may be prospective or retrospective. Of course, retrospective data can be obtained to compare different treatments, for example different types of mastectomy, but such a study would not be an experiment as it was not a pre- specified study performed under standardized conditions. Retrospective studies include case- control studies (see section 5.10).

5.2.3 纵向研究或横断面研究 5.2.3 Longitudinal or cross-sectional

纵向研究是指调查随时间变化的研究,可能与某种干预有关。观察在多个时间点进行,尽管并非所有观察数据都会用于分析。临床试验属于纵向研究,因为我们关注的是某一时间点开始的治疗对后期结果的影响。横断面研究是指对个体仅观察一次的研究。大多数调查都是横断面的,构建参考范围的研究也是如此。观察性研究可以是纵向的,也可以是横断面的,但实验通常是纵向的。
Longitudinal studies are those which investigate changes over time, possibly in relation to an intervention. Observations are taken on more than one occasion, although they may not all be used in the analysis. Clinical trials are longitudinal because we are interested in the effect of treatment commencing at one time point on outcome at a later time. Cross- sectional studies are those in which individuals are observed only once. Most surveys are cross- sectional, as are studies to construct reference ranges. Observational studies may be longitudinal or cross- sectional, but experiments are usually longitudinal.

还有一种“伪纵向”研究,即每个受试者仅被观察一次,但数据被用来描述随时间的变化。例子包括用于绘制儿童横断面生长曲线的研究以及月经周期中激素水平的研究(参见第5.13节)。
There is also the 'pseudo- longitudinal' study in which each subject is seen at only one time, but the data are used to describe changes over time. Examples are studies to derive cross- sectional growth charts for children and studies of hormone levels during the menstrual cycle (see section 5.13).

5.2.4 设计特征之间关系的总结 5.2.4 Summary of inter-relationships

图5.1总结了设计特征最可能的组合。实验研究几乎全部是前瞻性和纵向的,这与观察性研究形成了明显的区别,后者既可以是回顾性的,也可以是前瞻性的,还可以是横断面或纵向的。
Figure 5.1 summarizes the most likely possible combinations of design features. There is a clear distinction between experimental studies which are nearly all prospective and longitudinal, and observational studies which can be either retrospective or prospective and also either cross- sectional or


图5.1 研究设计类型。
Figure 5.1 Types of research design.

纵向研究。因此,本章后面将分别讨论实验研究和观察性研究。
longitudinal. For this reason experiments and observational studies are considered separately later in this chapter.

可以构建更复杂的研究设计分类(Bailar 等,1984),但图5.1描述了大多数研究的主要特征。
It is possible to construct more complex categorizations of research designs (Bailar et al., 1984), but Figure 5.1 describes the main features of most research studies.

到目前为止,设计的讨论涉及的是广泛的问题。接下来的章节将更详细地探讨设计,特别强调从样本到总体统计推断的两个重要方面—样本的代表性和对任何发现的关联的解释。
So far discussion of design has related to broad issues. The following sections look at design in more detail with particular emphasis on two important aspects of statistical inference from the sample to the population - the representativeness of the sample and the interpretation of any associations found.

5.2.5 对照 5.2.5 Control

无论何种实验,必须设有对照组,即不接受实验处理的组。通常科学上或伦理上都不能接受“让我们在一些患者身上试用这种新治疗,看看结果如何”的做法。更好的方法是设立对照组,这些患者接受常规治疗(或以某种不同方式治疗),以便进行比较。如果我们想评估孕妇数胎动的益处,就应设立一个同期对照组,这组孕妇不数胎动。这是医学中新疗法或新程序评估的关键组成部分。
Whatever the experiment, it is essential to have a comparison, or control, group to which the experimental procedure is not applied. It is not usually scientifically or ethically acceptable to say 'Let's try this new treatment on some patients and see what happens'. It is far better to have a control group who are treated normally (or in some way differently), against which comparisons can be made. If we wish to evaluate the benefits of mothers counting fetal movements in pregnancy we should have a concurrent control group of mothers who do not count movements. This is a key component of the evaluation of new therapies or procedures in medicine.

观察性研究中也建议设立对照组。如果我们询问使用视觉显示终端(VDT)的员工是否有眼睛疲劳或背痛,也应对一组不使用VDT的相似员工提出相同问题。
Controls are also advisable in observational studies. If we ask users of visual display terminals (VDTs) if they get eye strain or backache, we should also ask the same questions of a group of comparable employees who do not use VDTs.

在每种情况下,对照组的存在都增强了研究结果推断的力度。然而,正如我下面将讨论的,观察性研究中选择合适的对照组并不容易。
In each case the presence of the control group strengthens the inferences that may be made from the results of the study. However, as I shall discuss below, the choice of suitable controls in observational studies is not easy.

5.3 变异来源 5.3 SOURCES OF VARIATION

第3章和第4章都开始于关于变异性对统计方法重要性的评论。行为或对某种刺激(如烟草或抗生素药物)的反应的变异是常态。如第3章所述,有些变异来源是已知或怀疑的,但仍有大量变异无法解释。例如,我们知道几个影响出生体重的变量,如妊娠期长短、胎儿性别、产次、母亲吸烟情况、海拔高度等,但包含这些信息的统计模型仅能解释出生体重变异的约四分之一。虽然无疑还有未被识别的其他因素影响变异,但重要因素很可能已被发现。因此,观察到的大部分变异必须被视为无法解释的,我们称之为随机变异。大多数临床测量中存在相当大的随机变异。有些测量如体温,变异较小;而出生体重、血压或许多血清成分则存在巨大变异。在设计比较组间某临床测量水平的研究时,必须考虑这种自然变异。我们可以将这种随机变异视为“背景噪声”,我们试图在其上检测某种感兴趣的“信号”效应。这与其他领域使用的“信噪比”概念有很好的类比。如果结果测量高度变异,我们需要更大的样本量以检测系统性效应。另一种可能的设计是通过研究个体基线水平的变化来消除个体间差异。
Chapters 3 and 4 both began with comments about the importance of variability to the statistical approach. Variability in behaviour or response to some stimulus, be it tobacco or an antibiotic drug, is the norm. As noted in Chapter 3, some sources of variability may be known, or suspected, but much remains unexplained. For example, we know several variables that affect birth weight, such as length of gestation, fetal sex, parity, maternal smoking, height above sea level, and so on, but statistical models incorporating such information explain only about a quarter of the variability in birth weight. While there are undoubtedly other factors not yet identified that contribute to the variability, it is most unlikely that any important factors remain unidentified. The bulk of the observed variability must therefore be considered unexplainable, which we call random variation. There is considerable random variation in most clinical measurements. For some, such as body temperature, there is relatively little variation, but for others, such as birth weight, blood pressure, or many serum constituents, there is enormous variation. When we are designing a study to compare groups with respect to levels of some clinical measurement, this natural variability must be borne in mind. We can think of this random variability as 'background noise', against which we are trying to detect some effect, or 'signal', of interest. There is a good analogy here with the concept of the 'signal to noise' ratio used in other fields. If the outcome measurement is highly variable we will need a larger study to be able to detect a systematic effect of interest. Another possible design to consider is that in which we remove between subject variation by studying within subject changes from a baseline level.

此外,个体在未直接研究但可能影响感兴趣变量的其他特征上也表现出类似的变异。实验设计的许多原则旨在控制我们不感兴趣的变异,从而使我们能集中关注感兴趣的变异。这种总体变异对研究设计有两个相关的后果:
Further, individuals will exhibit similar variation in other characteristics not directly being studied but which might affect the variables of interest. Many of the principles of experimental design are aimed at trying to control variation that we are not interested in, so that we can focus our attention on the variability that we are interested in. Two consequences of this general variability relevant to the design of studies are:

  1. 采样时需谨慎,确保样本具有代表性。
    Care is needed to make samples representative of the population.
  2. 在比较研究中,需谨慎使各组在已知变异来源上保持相似。
    In comparative studies care is needed in making groups similar with respect to known sources of variation.

此外,我们需牢记,当感兴趣的测量值变异较大时,需要较大的样本量以获得可靠结果。
In addition, we need to bear in mind that when the measurement of interest is highly variable, large samples are needed to get reliable results.

这些问题将在下文讨论,先针对实验研究,再针对观察性研究。然而,第15章提供了比较研究中计算合适样本量的方法,因为这些方法多用于临床试验设计。
These issues are discussed below, firstly for experimental studies and then for observational studies. However, methods for calculating the appropriate sample size for comparative studies are given in Chapter 15. as they are most often used when designing clinical trials.

在详细探讨不同类型的实验设计之前,考虑一个真实研究有助于说明许多相关问题。
Before examining different types of experimental design in detail it should help to consider a real study that illustrates many of the issues.

一个实验:两臂血压是否相同?79
An experiment: is the blood pressure the same in both arms? 79

5.4 一个实验:两臂血压是否相同? 5.4 AN EXPERIMENT: IS THE BLOOD PRESSURE THE SAME IN BOTH ARMS?

血压是一项特别具有变异性的测量。不仅个体间差异显著(部分原因已知),且个体内部随时间变化巨大。血压存在明显的24小时节律变化(日夜变化)以及日间波动。此外,血压测量本身也较为困难。近年来,发展了通过手臂内置导管连接小型录音设备连续记录血压的新技术。动态血压监测更具信息量,因为它提供24小时数据,且测量直接,避免了观察者误差。许多人将这种动脉内测量技术视为“金标准”,用以评估新方法,尤其是间接(即非侵入性)动态血压监测仪。鉴于血压的变异性,重要的是同时用两种设备测量,因此需同时测量双臂。由此产生的问题是,左右臂血压是否存在系统性差异。
Blood pressure is a particularly variable measurement. Not only does it vary considerably between individuals, for which we have partial explanation, but it varies greatly over time for each individual. There is marked variation over 24 hours (circadian variation) as well as day- to- day variation. In addition, blood pressure is difficult to measure. In recent years new technology has been developed to allow continuous recording of blood pressure via an indwelling catheter in the arm linked to a small tape recorder. Ambulatory blood pressure monitoring is more informative, as it gives data for 24 hours, and also potentially more accurate as it measures blood pressure directly and without observer error. Many people regard this intra- arterial technique as the 'gold standard' against which to judge new methods, in particular indirect (i.e. non- invasive) ambulatory recorders. Because of the variability referred to it is important to take simultaneous measurements using the two devices, and thus to use both arms. The question then arises as to whether there might be any systematic difference in blood pressure between the left and right arms.

Gould等人(1985)描述了一项旨在回答该问题的研究。设计如下:所用设备为“随机零”血压计,一种旨在消除观察者偏差的仪器(观察值需加上一个事后才知的“随机”量)。每臂绑上袖带,两袖带连接同一血压计,使用电动气泵平衡两袖带压力。显然需要两名观察者,各负责一臂。尽管使用特殊血压计,仍须避免观察者只测一臂,以防观察者间存在系统差异。因此,每观察者需对左臂和右臂各测半数样本,且建议(非必须)每观察者测量每位患者的双臂。类似地,两袖带可能略有差异,故每袖带需均等应用于两臂,每位患者均如此。鉴于血压已知变异性,决定每观察者在每位患者的每臂用每袖带测量两次,共16次测量。最后,患者血压在测量过程中可能系统性变化,因此袖带应用顺序及观察者测量顺序均采用随机化方法,详见5.7节。观察者间不交流测量结果。
A study to answer this question was described by Gould et al. (1985). The design was as follows. The equipment used to measure the blood pressure was a 'random zero' sphygmomanometer, a machine designed to remove observer bias. (To the reading observed must be added another 'random' quantity not known until afterwards.) A cuff was attached to each arm, and both were connected to the same sphygmomanometer. An electric air pump was used to equalize the pressure to the two cuffs. Clearly it was necessary to have two observers - one to each arm. Despite the use of a special sphygmomanometer it was important that the observers did not measure only one arm in case there was a systematic difference between the observers. Thus each observer had to take half of the observations on the left arm and half on the right, and it was felt sensible (although it was not essential) for each observer to measure both arms of each patient. A similar argument applied to the two cuffs, which might have been slightly different. Thus each cuff had to be applied equally to each arm and again this was carried out for each patient. In view of the known variability of blood pressure it was decided that each observer would take two measurements using each cuff on each arm of each patient, giving 16 measurements per patient. Finally, there might have been a tendency for a patient's blood pressure to change systematically during the series of measurements. Thus the order in which the cuffs were applied to the arms and the order in which the observers measured the two arms was varied using randomization. A detailed explanation of randomization is given in section 5.7. There was no communication of results between observers.

该研究在91名原发性高血压患者中进行。
The study was carried out on 91 subjects with essential hypertension.

上述设计旨在尽可能纯粹地比较左臂和右臂的血压。此外,还记录了每条手臂的周长,并记录了每位受试者测量的顺序。本研究展示了实验设计的许多特征,其中一些将在第5.5节中详细讨论:
The above design was used to try to get as pure a comparison as possible of the blood pressure in the left and right arms. In addition the circumference of each arm was recorded, and a record was kept of the order in which the measurements were taken for each subject. This study illustrates many features of the design of an experiment, some of which will be discussed in more detail in section 5.5:

观察者数量 每位受试者需要两名观察者,但即使不是必要,通常也建议有多个观察者,因为这可以量化观察者之间的差异(见第14.2节)。
Number of observers It was necessary to have two observers per subject, but it is often a good idea to have more than one observer even when it is not necessary, as it allows the differences between observers to be quantified (see section 14.2).

重复测量 在每种实验条件组合下进行多次测量是理想的,因为这能提高对感兴趣效应的估计精度。然而,重复测量需要是独立的。在臂部比较研究中,测量是独立的,因为所用仪器类型使观察者不知道他们之前的测量结果。
Replicated measurements It is desirable to take more than one reading in each combination of experimental conditions as it gives greater precision for estimating the effects of interest. The replicates need to be independent readings, however. They were independent in the arm comparison study because the type of machine used meant that the observers did not know what their previous measurement was.

平衡设计 各实验因素组合下不必一定采集相同数量的观察值,但如果一切平衡,如上述研究所示,分析会简单得多。
Balanced design It is not essential that the same number of observations is taken for each combination of experimental factors, but if everything is balanced, as in the above study, the analysis is very much simpler.

随机化 每位患者的两臂分配观察者和袖带的顺序是随机确定的。随机化是实验设计的关键要素之一。
Randomization The order in which the observers and cuffs were allocated to the two arms for each patient was determined at random. Randomiza­ tion is one of the key elements of experimental design.

协变量 有时存在非实验特征(协变量)需要记录,因为它们可能影响结果。虽然这些特征可能在不同观察间变化,比如环境温度,但也可能只在不同受试者间变化,比如年龄。在本研究中,臂围被认为是一个可能的协变量,因为它影响袖带的贴合度。臂围介于上述两种情况之间,在同一受试者内(即两只手臂之间)会变化,但不会在不同观察间变化。另一个潜在的协变量是观察顺序。设计采用随机和平衡方法,因为预期重复测量时记录的血压会下降。然而,在分析中考虑测量顺序可以提高精确度。
Covariates Sometimes there are non- experimental features (covariates) that need to be recorded as they might have affected the results. While they may vary from observation to observation, such as ambient tem­ perature, they may vary only from subject to subject, such as age. In this study arm circumference was considered to be a possible covariate as it affects the fit of the cuff. Arm circumference is intermediate between the two examples given, varying within subject (i.e. between arms) but not from observation to observation. Another potential covariate was the order of observations. The design was randomized and balanced because it was anticipated that recorded blood pressure would fall over repeated measure­ ments. However, it is possible to take account of the order of measure­ ments in the analysis to improve precision.

样本量 采用了较大样本量,以提供两臂间差异的精确估计。
Sample size A large sample was taken to provide a precise estimate of the difference between the arms.

5.5 实验设计 5.5 THE DESIGN OF EXPERIMENTS

实验的设计应尽可能简单明了地回答所关注的问题。考虑数据的获取方式非常重要
An experiment should be designed to answer the question of interest as simply and clearly as possible. It is important to consider the way the data

在设计实验时会进行分析,因为这可以避免后续的复杂情况。本章讨论一般的实验设计。第15章将深入探讨临床试验,因为其中涉及许多特殊问题。
will be analysed when designing an experiment as this can save complications later. This chapter considers experiments in general. Chapter 15 considers clinical trials in depth, as there are many special issues involved.

本节讨论设计实验时需要考虑的一些重要方面。
In this section I discuss some of the more important aspects to consider when designing an experiment.

5.5.1 偏倚 5.5.1 Bias

任何研究,无论是实验性还是观察性,都是为了解答一个或多个具体问题而设立的。结果的可靠性及其对发现的解释至关重要。实验提供了最有力的机会去接近真相,但必须采取若干预防措施以确保结果不带偏倚。例如,在比较实验中,如臂部比较研究,重要的是被比较的观察组在除实验操控因素外的所有方面都应具有可比性。臂部比较研究的多个设计特征正是基于这一考虑。
Any study, whether experimental or observational, will be set up to answer one or more specific questions. The reliability of the results, and thus the interpretation of the findings, is crucial. An experiment provides the best opportunity to get at the truth, but there are several precautions that should be taken to ensure that the results are not biased. For example, in a comparative experiment, such as the arm comparison study, it is important that the groups of observations being compared are comparable in all aspects other than that being manipulated by the experimenter. Several of the design features of the arm comparison study were included for this reason.

偏倚可能源于研究的结构性缺陷。例如,如果一位观察者负责所有左臂的测量,另一位负责所有右臂的测量,那么臂间差异将无法与观察者间差异区分开,这种效应称为混杂。事实上,该研究正是为了检验在不同仪器分别测量单侧臂时是否存在混杂效应。确保不同观察者-臂-袖带组合在第1、2、3、4顺序中均等使用,是避免偏倚的又一例证。
Bias can occur through structural deficiencies in a study. For example, if one observer had taken all measurements on the left arm and the other all those on the right arm, the between arm differences would have been inseparable from any between observer differences, an effect called confounding. In fact, that study was carried out expressly to see if there would be confounding when different machines were compared one to an arm. Making sure that the different observer- arm- cuff combinations were used equally in the 1st, 2nd, 3rd and 4th orders is another example of avoiding bias.

5.5.2 随机化 5.5.2 Randomization

偏倚的一个重要来源是受试者在设计中未考虑的特征上的差异。例如,如果在一组患者中仅测量左臂血压,另一组仅测量右臂血压,那么观察到的左右臂平均差异可能受两组间与血压相关变量(如年龄)差异的影响。显然,在同一患者身上使用双臂测量更好,但大多数研究中程序或治疗无法同时施加于同一受试者。常用方法是将治疗随机分配给患者。如第5.7节所述,随机一词在统计学中有特定含义。随机分配是实验设计的基本原则之一。另一方法是找到特征相似的受试者配对,并将治疗随机分配给配对对象。匹配方法详见第5.8节。
An important possible source of bias is the way in which subjects vary in features that are not part of the design. For example, if we had measured blood pressure in the left arm only in one group of patients and in the right arm only in another group, then the average difference observed between left and right arms could be affected by differences between the groups with respect to any variable related to blood pressure, such as age. Clearly it is better to use both arms in the same patients, but in most studies the procedures or treatments cannot be given to the same individuals. The usual approach here is to allocate treatments to patients at random. As described in section 5.7, the word random has a specific statistical meaning. Random allocation is one of the fundamental principles of experimental design. Another device is to find pairs of subjects with closely similar characteristics and allocate treatments to the matched pair at random. Matching is discussed in section 5.8.

在臂部比较研究中,观察者测量左右臂的顺序及两种袖带的使用顺序均采用随机化。虽然没有具体理由认为例如观察者1总是先测左臂会产生偏倚,但随机排序作为防范潜在未知微妙影响的保障措施被采用。
In the arm comparison study the order in which the observers measured the left and right arms and the order of use of the two cuffs were randomized. There was no specific reason to expect a bias from, for example, observer 1 always starting on the left arm, but random ordering was used as a safeguard against possible subtle unknown effects.

5.5.3 盲法 5.5.3 Blinding

偏倚也可能通过潜意识效应产生。例如,观察者的判断可能受知晓受试者所接受治疗或之前测量结果的影响。臂部比较研究通过选择血压测量仪器避免了后者问题。前者问题在临床试验中特别重要,通常希望患者和评估者均不知晓所用治疗,这一过程称为盲法(详见第15章)。
Bias can also occur through subconscious effects. For example, observers' judgements may be affected by knowing the treatment that a subject is getting, or by knowledge of a previous measurement for that subject. The latter problem was avoided in the arm comparison study by the choice of blood pressure measuring machine. The former problem is especially relevant in clinical trials, where it is desirable to keep both patients and assessors in ignorance of the treatment given, a procedure known as blinding (see Chapter 15).

5.5.4 重复 5.5.4 Replication

对于高度变异或难以准确测量的指标,可能需要对每个个体进行多次测量。这些重复测量可以在分析中作为独立观察值处理,虽然这会使分析更复杂,但能更有效地发现感兴趣的效应。只有当重复测量相互独立时,这种分析才有效,而如果观察者知道第一次测量结果,通常不满足独立性的要求。
For measurements that are highly variable or difficult to measure accurately it may be useful to take more than one measurement on each individual. These replicates can be treated in the analysis as separate observations, which may make the analysis more complicated but gives greater potential to detect effects of interest. This analysis is only valid if the replicates are independent, which is often not the case if the observer knows what measurement they obtained the first time.

更常见的做法是使用重复测量的平均值进行分析。这种方法更接近临床实践—一些“噪声”较大的变量,如血压、最大呼气流量和超声测量,通常会重复测量。
More often the average of the replicates is used in the analysis. This latter approach may mirror clinical practice - some 'noisy' variables such as blood pressure, peak expiratory flow rate, and ultrasound measurements are usually repeated.

5.5.5 样本选择 5.5.5 Sample selection

研究中样本应尽量代表目标人群,但在实验研究中这一点不如观察性研究重要。例如,手臂差异研究中样本的选择不太可能影响结果。更重要的是确保比较的亚组尽可能相似。
It is always desirable for the sample in a study to be representative of the population of interest, but this is not as important in experiments as in observational studies. For example, it is unlikely that the choice of the sample for the arm difference study would have affected the results. It is much more important to ensure that the sub- groups being compared are as similar as possible.

虽然原则上代表性样本应通过从总体中随机抽取获得,但实际中几乎无法做到这一点。然而,样本应尽量与相关总体相似,因此必须清楚描述样本的选择方法。
Although in principle representative samples are best obtained by random selection from the population, this ideal is virtually never met in practice. However, the sample should be chosen to be as similar as possible to the relevant population, so it is essential to be able to describe just how the sample was chosen.

这些考虑对大多数动物实验可能并不适用。
These considerations are probably irrelevant for most animal experiments.

5.5.6 样本量 5.5.6 Sample size

另一种应对变异性的方法是增加样本量。更大的样本能更精确地评估感兴趣的效应。确定合适的样本量在临床试验中尤为常见,第15.3节介绍了比较研究中选择样本量的正式方法。类似原则适用于所有研究,但方法较复杂,需专家协助。
Another way of combating variability is to increase the sample size. Larger samples enable us to evaluate effects of interest more precisely. The determination of an appropriate sample size is most common in clinical trials and section 15.3 describes formal methods for choosing an appropriate sample size in comparative studies. Similar principles apply to all studies, but the methods can be complicated so expert assistance is required.

5.6 实验的结构 5.6 THE STRUCTURE OF AN EXPERIMENT

在设计实验中,如手臂比较研究,研究者可能控制多个条件(称为因素)。绘制设计结构图有助于理清设计思路,同时指导数据分析方法。
In a designed experiment such as the arm comparison study there may be several conditions (called factors) being controlled by the investigator. It may be helpful to draw a diagram to show the structure of the design. As well as clarifying the design the diagram will show how the data should be analysed.

一个简单的例子是比较三组接受不同止痛药治疗偏头痛的受试者的实验。图5.2展示了该设计的简单结构。每个x代表一个观测值。在此设计中,三组的大小不必相等,但在更复杂的设计中,组大小相等是非常理想的。如果研究设计
A simple example is an experiment to compare three separate groups of subjects given different analgesics to combat migraine. Figure 5.2 shows the simple structure of this design. Each x denotes an observation. In this design there is no need for the three groups to be of equal size but in more complicated designs equal sizes are highly desirable. If the study design


图5.2 比较接受止痛药A、B或C的三组受试者的研究结构。每个x表示一名受试者。
Figure 5.2 Structure of a study to compare three groups of subjects receiving analgesics A, B or C. Each x indicates one subject.

改为每名受试者以随机顺序接受所有三种止痛药时,设计如图5.3所示。此处同一受试者的观测值用线连接。
were changed so that each subject received all three analgesics in random order, the design would be as shown in Figure 5.3. Here observations on the same subject are connected.

一项研究可能结合这两种特征,即受试者被多次测量,但不同组的受试者接受不同处理。例如,我们可能想比较不同饮食前后受试者的体重;图5.4展示了相应的设计。图5.2至5.4说明了“组内比较”与“组间比较”的重要区别。
A study may combine both these features, so that subjects are examined more than once but different groups of subjects are treated differently. For example, we may wish to compare subjects' weights before and after different diets; Figure 5.4 shows the appropriate design. Figures 5.2 to 5.4 illustrate the important distinction between within subject and between subject comparisons.

比较左右臂血压的研究更为复杂。有三个因素—臂、观察者和袖带—并且对每种组合进行了两次测量(重复)。该研究设计如图5.5所示,称为因子设计,因为使用了所有因素的组合。
The study comparing blood pressure in the left and right arms was more complicated. There were three factors - arms, observers and cuffs - and two measurements (replicates) were taken for each combination. The design of this study, which is shown in Figure 5.5, is known as a factorial design as all combinations of factors are used.

在具体情况下无法断言哪种设计最佳。选择控制哪些因素、哪些因素为组间、哪些为组内,以及每个受试者应进行多少次观测都很困难,通常需要深入思考才能得出满意的设计。此阶段专家统计学帮助尤为重要。设计中的任何缺陷事后都无法弥补。
It is not possible to say what the best design is in any given circumstance. The choice of factors to control, which factors are between subject and which within, and how many observations to take for each subject is difficult, and it will often take much thought to arrive at a satisfactory design. Expert statistical help is particularly valuable at this stage. Any weaknesses in the design cannot be rectified later.


图5.3 比较同一组受试者接受三种治疗的研究结构。线条连接同一受试者的观测值,观测顺序为随机。
Figure 5.3 Structure of a study to compare three treatments in one group of subjects. Lines join observations on the same subject, which are made in random order.


图5.4 比较两个组在治疗前后测量的研究结构。
Figure 5.4 Structure of a study to compare two groups measured before and after treatment.


图5.5 比较左右臂血压的研究结构—三因素因子设计。
Figure 5.5 Structure of the study to compare blood pressure in the left and right arms - a three way factorial design.

5.7 随机分配 5.7 RANDOM ALLOCATION

本章前面多次提到随机分配。本节讨论实验研究中随机化的原理和方法。
There have been several mentions of random allocation earlier in this chapter. The rationale for and methods of randomization in experimental studies are discussed in this section.

使用随机化有两个主要原因。第一个原因是防止偏倚。如前所述,我们希望比较的治疗组之间不存在系统性差异。如果受试者接受由研究者(或受试者本人)选择的治疗,通常会产生偏倚—通常是无意识的,但有时也可能是有意的。我们可以通过随机分配治疗给受试者来避免这种可能性。
There are two main reasons for using randomization. The first reason is to prevent bias. As noted earlier, we want to compare treatments between groups which do not differ in any systematic way. If subjects receive treatments chosen by the investigator (or indeed the subject) there is the likelihood of bias arising - usually subconscious but occasionally intentional. We can avoid this possibility by allocating treatments to subjects at

随机分配。关于临床试验中这一问题的进一步讨论见第15.2.2节。
random. There is further discussion of this issue with regard to clinical trials in section 15.2.2.

偏倚也可能源于未知效应。例如,当对每个受试者使用两种或多种治疗(或实验条件)时,建议随机化它们施加给受试者的顺序,以防时间或测量顺序相关的未知偏倚存在。这也是手臂比较研究中测量顺序随机化的依据。
Bias can also arise through unknown effects. For example, when two or more treatments (or experimental conditions) are used for each subject it is advisable to randomize the order in which they are applied to each subject in case there is any unknown bias associated with time or the order of measurements. This argument was behind the randomization of the order of measurements in the arm comparison study.

随机化的另一个原因是统计理论基于随机抽样的理念。在随机分配的研究中,治疗组间的差异表现得像随机样本之间的差异。如第4章所述,我们知道随机样本的预期表现,因此可以将观察结果与预期进行比较,例如假设各治疗效果相等。
The other reason for randomizing is that statistical theory is based on the idea of random sampling. In a study with random allocation the differences between treatment groups behave like the differences between random samples. As noted in Chapter 4, we know how random samples are expected to behave, and so can compare the observations with expectation, for example assuming that the treatments are equally effective.

5.7.1 简单随机化 5.7.1 Simple randomization

随机并不等同于随意,这一点常被忽视。随机分配意味着每位患者接受每种治疗的机会是已知的,通常是相等的,但治疗的具体分配不可预测。因此,交替给患者分配两种治疗并非随机分配。最简单的随机分配方法是抛硬币—正面为治疗A,反面为治疗B。等效的方法是使用随机数字表,如表B13。这些表中每个数字出现的频率相等,排列顺序随机且完全不可预测。另一种选择是使用计算机的随机数生成器。
It is not always appreciated that random does not mean the same as haphazard. By random allocation we mean that each patient has a known chance, usually an equal chance, of being given each treatment, but the treatment to be given cannot be predicted. Thus alternately allocating two treatments to a series of patients is not random allocation. The simplest method of random allocation is tossing a coin - heads is treatment A, tails is treatment B. An equivalent method is to use a table of random numbers, such as that in Table B13. In these tables each number occurs equally often, and the ordering is random, and so completely unpredictable. Another option is to use a random number generator on a computer.

第一步是确定随机数字与不同实验组的对应关系。例如,若用表B13将两种治疗平均分配给受试者,可将奇数视为一种治疗,偶数视为另一种。然后选择一个起始点,可以用针或其他同样随意的方法确定。还可以选择读取表格的方向。
The first step is to decide the correspondence between the random numbers and the different experimental groups. For example, if we wish to allocate equally two treatments to subjects using Table B13 we could take odd numbers to indicate one treatment and even numbers to indicate the other. We must then choose a place to start, and this can be done using a pin or some equally arbitrary method. In addition we can choose the direction in which to read the table.

假设我们从起始点开始,表中前几个两位数是
Suppose that the first two digit numbers in the table from our starting place are

12 19 20 52 81 30 74 93 02 67 41 50,依此类推。
12 19 20 52 81 30 74 93 02 67 41 50, etc.

如果我们将奇数分配给治疗A,偶数分配给治疗B,那么这些数字表示的序列是
If we take odd numbers for treatment A and even numbers for treatment B, then these numbers indicate the sequence

BABBABBABAAB
BABBABBABAAB

用于前12名受试者。或者我们也可以单独取每一位数字,得到
for the first 12 subjects. Alternatively we could take each digit on its own. to give

ABAABBABBAABABAABBBABAAB

用于前24名受试者。第三种方法是将数字00到49分配给A,50到99分配给B,当然还有无数其他可能的策略。使用哪种方法都没有区别。
for the first 24 subjects. A third approach would be to take numbers 00 to 49 for A and 50 to 99 for B, and there are countless other possible strategies. It makes no difference which is used.

我们可以很容易地将最后一种方法推广到两个以上的治疗或实验条件。例如,我们可以为三组使用以下方案:
We can easily generalize the last approach to situations with more than two treatments or experimental conditions. For example, we could use the following scheme for three groups:

01到33:治疗A
01 to 33: treatment A

34到66:治疗B
34 to 66: treatment B

67到99:治疗C
67 to 99: treatment C

00:忽略
00 : ignored

其他设计也类似。请注意,在序列中的任意时刻,分配给每个治疗组的患者数量可能会不同。我们有时希望各组人数始终非常接近,这可以通过区组随机化实现。此外,简单随机化使得各组受试者特征的分布完全依赖于随机机会。我们常常知道或怀疑某些受试者的表现会不同,例如他们可能有不同的预后,因此希望在不同治疗组中保持这些类别内的受试者数量相似。我们可以通过分层随机化或最小化方法实现这一点。以下将介绍这些技术。显然,上述方法很容易调整为加权随机化,从而导致不同组间人数不等。例如,我们可以通过将01到66分配给A,67到99分配给B,实现治疗A和B按2比1的比例分配。
and similarly for other designs. Notice that at any point in the sequence the numbers of patients allocated to each treatment will probably differ. We sometimes wish to keep the numbers in each group very close at all times, which we can achieve by block randomization. Further, with simple randomization the distribution of the characteristics of the subjects in each group is left completely to chance. We often know or suspect that some subjects will behave differently, for example they may have different prognoses, and so it is desirable to keep the numbers within these classes similar in the different treatment groups. We can achieve this by stratified randomization or minimization. These techniques are all described below.Clearly it is very easy to adapt the above method to give a weighted randomization, leading to unequal numbers in the different groups. For example, we could allocate treatments A and B in proportions 2 to 1 by using 01 to 66 for A and 67 to 99 for B.

显然,很容易调整上述方法以实现加权随机化,从而导致不同组之间的样本数量不相等。例如,我们可以通过将01到66分配给A,67到99分配给B,实现A和B按2比1的比例分配治疗。
Clearly it is very easy to adapt the above method to give a weighted randomization, leading to unequal numbers in the different groups. For example, we could allocate treatments A and B in proportions 2 to 1 by using 01 to 66 for A and 67 to 99 for B.

5.7.2 区块(或受限)随机化 5.7.2 Block (or restricted) randomization

区块(或受限)随机化用于保持不同组的受试者数量始终接近平衡。例如,如果我们每次考虑四个受试者为一个区块,有六种方法可以分配治疗,使得两个受试者接受A,两个接受B:
Block (or restricted) randomization is used to keep the numbers of subjects in the different groups closely balanced at all times. For example, if we consider subjects in blocks of four at a time, there are six ways in which we can allocate treatments so that two subjects get A and two get B:

1 AABB 4 BBAA 2 ABAB 5 BABA 3 ABBA 6 BAAB
1 AABB 4 BBAA 2 ABAB 5 BABA 3 ABBA 6 BAAB

如果我们仅使用这六种治疗分配方式的组合,那么任意时刻两组人数的差异不会超过
If we use combinations of only these six ways of allocating treatments then the numbers in the two groups at any time can never differ by more than

两个,且通常相同或相差一个。我们随机选择区块以创建分配序列。使用前面的随机序列开始
two, and they will usually be the same or one apart. We choose blocks at random to create the allocation sequence. Using the previous random sequence beginning

121920528130749302674150

我们可以省略范围外的数字(1到6之外),得到
we can omit those numbers outside the range 1 to 6 to get

12122134326415

从中我们可以构造区块分配序列
from which we can construct the block allocation sequence

AABB ABAB AABB ABAB ABAB AABB ABBA ABBA
AABB ABAB AABB ABAB ABAB AABB ABBA ABBA

等等。注意序列开头看似非随机的部分—121221—其中仅出现了六个数字中的两个。随机数字列表总会出现这样的奇怪序列—如果没有,这些数字就不是真正随机的。查看表B13可以发现许多类似序列。
and so on. Notice the apparently non- random beginning of the sequence - 121221 - in which only two of the six numbers appear. Lists of random numbers always throw up peculiar sequences like this one - they would not be random if they did not. Inspection of Table B13 shows many such sequences.

随机区组的大小可以是任意的,但使用治疗组数的倍数更为合理。应避免使用较大的区组,因为它们对平衡的控制较差。在临床试验中,随机分配序列对实际给药人员保持保密是非常重要的。通常通过准备一叠不透明、编号且密封的信封来实现,每个信封内包含一名患者的分配信息。即便如此,知道使用了限制性随机化的人仍可能提前推断出每隔四名患者的治疗分配。因此,最好让随机数的使用者不知道序列的构造方式,并且可能还需要随机变换区组长度,例如混合使用大小为2、4或6的区组。当治疗组超过两个时,也采用类似的方法。例如,三种治疗可使用大小为3、6或9的区组。显然,这些考虑不适用于动物实验或人体样本的实验室研究。
Randomized blocks can be of any size, but using a multiple of the number of treatments is more logical. Large blocks are best avoided as they control balance less well. In clinical trials it is highly desirable for the randomization sequence to be kept hidden from those actually giving the treatments. This is often achieved by creating a pile of opaque numbered sealed envelopes each containing the allocation for one patient. Even so, with the knowledge that restricted randomization is being used, it is possible to deduce in advance the treatment to be given to every fourth patient. For this reason it is better for the users of the random numbers not to know how the sequence was constructed, and it may also be desirable to vary the block length, again at random, perhaps using a mixture of blocks of size 2, 4, or 6. A similar approach is used when there are more than two treatments. For example, blocks of size 3, 6, or 9 can be used for three treatments. Obviously these considerations do not apply to experiments on animals or laboratory experiments on human samples.

关于临床试验中治疗分配相关问题的进一步讨论见第15.2节。
There is further discussion in section 15.2 of the problems associated with treatment allocation in clinical trials.

5.7.3 分层随机化 5.7.3 Stratified randomization

虽然简单随机化可以消除分配过程中的偏倚,但它不能保证各组受试者的年龄分布相似。实际上,在小规模研究中,很可能出现某些偶然的不平衡,这可能使结果解释复杂。即使在超过100名受试者的研究中,某些罕见特征也可能因偶然因素存在显著差异。在许多临床研究中,事先已知患者的某些亚组对治疗的反应不同。
While simple randomization removes bias from the allocation procedure, it does not guarantee, for example, that the subjects in each group have similar age distributions. Indeed in small studies it is highly likely that some chance imbalance will occur, which might complicate the interpretation of results. Even in studies with over 100 subjects there may be some substantial variations by chance, especially for characteristics that are quite rare. In many clinical studies it is known beforehand that subgroups of

在这种情况下,建议确保接受各治疗的受试者具有相似的特征。
patients are expected to respond differently to treatment. Here it is advisable to ensure that the subjects receiving each treatment have similar characteristics.

我们可以使用分层随机化,在不牺牲随机化优势的前提下,实现重要特征的近似平衡。方法是为每个亚组(层)生成独立的区组随机化列表。例如,在比较两种乳腺癌替代治疗的研究中,按绝经状态分层非常重要。应获取两组独立的随机数列表,据此为绝经前和绝经后女性分别准备两堆密封信封。必须基于每个层内的区组随机化进行分层治疗分配,而非简单随机化;否则无法控制各层内治疗的平衡,分层的目的将无法实现。
We can use stratified randomization to achieve approximate balance of important characteristics without sacrificing the advantages of randomization. The method is to produce a separate block randomization list for each subgroup (stratum). For example, in a study to compare two alternative treatments for breast cancer it would be important to stratify by menopausal status. Two separate lists of random numbers should be obtained, from which two separate piles of sealed envelopes can be prepared, for premenopausal and postmenopausal women. It is essential that stratified treatment allocation is based on block randomization within each stratum rather than simple randomization; otherwise there will be no control of balance of treatments within strata, and so the object of stratification will be defeated.

分层随机化可以扩展到两个或更多的分层变量。例如,我们可能希望在乳腺癌试验中将分层扩展到肿瘤大小和阳性淋巴结数。必须为每种类别组合生成独立的随机化列表。如果我们有两个肿瘤大小组(例如 ),三个淋巴结受累组(),以及绝经状态,则共有 个层,这可能超出实际可行的范围。多层分层还有一个问题,即某些类别组合可能很少见,导致区组随机化预期的治疗平衡无法实现。
Stratified randomization can be extended to two or more stratifying variables. For example, we might wish to extend the stratification in the breast cancer trial to tumour size and number of positive nodes. We have to produce a separate randomization list for each combination of categories. If we had two tumour size groups (say and ) and three groups for node involvement as well as menopausal status, then we have strata, which may exceed the limit of what is practical. There is the further problem with multiple strata that some of the combinations of categories may be rare, so that the treatment balance expected from the use of block randomization does not occur.

应仔细考虑用于分层的变量,限制选择那些已知具有预后重要性的变量。许多试验以年龄和性别作为分层变量。虽然年龄通常具有预后意义,但性别往往不具预后价值,因此不必用于分层。
Some thought should be given to which variables are used for stratification, restricting the choice to variables known to be prognostically important. Many trials stratify using age and sex. While age is frequently known to be prognostic, sex is often not prognostic and need not be used for stratification.

在多中心研究中,除非有集中协调的随机分配服务,否则每个中心内的患者需单独随机分配。因此,“中心”是一个分层变量,且可能还有其他分层变量。
In a multicentre study the patients within each centre will need to be randomized separately unless there is a central coordinated randomizing service. Thus 'centre' is a stratifying variable, and there may be other stratifying variables as well.

在小规模研究中,分层变量不宜超过一或两个,因为层数很快会接近受试者总数。当确实需要在多个变量上实现治疗组间的高度相似时,可以使用最小化方法(见第5.8节)。
In small studies it is not practical to stratify on more than one or perhaps two variables, as the number of strata can quickly approach the number of subjects. When it is really important to achieve close similarity between treatment groups for several variables minimization can be used (see section 5.8).

5.7.4 随机化的其他用途 5.7.4 Other uses of randomization

在某些研究中,将治疗分配给个体受试者是不可能或不切实际的。假设我们希望评估
In some studies it is either impossible or impractical to allocate treatments to individual subjects. Suppose that we wish to evaluate the effectiveness of

一项通过电视或报纸进行的健康教育宣传活动在提高对毒品危害的认识,甚至改变行为方面的有效性。我们无法随机针对个体,而是可以随机分配整个区域接受不同的媒体报道。对于大量小区域,这种集群随机化应能提供可靠的结果,但对于数量较少且面积很大的区域(如上述例子中可能出现的情况),确保区域间的可比性存在问题。在这里,研究开始前获得基线数据非常重要,以便比较研究期间各区域内的变化。实验研究中有时使用的其他集群包括学校、医院和家庭。
a health education campaign on television or in the newspapers to increase awareness of the dangers of drugs, or indeed to change behaviour. We cannot target individuals at random, but rather we can randomly assign whole areas to receive different media coverage. With a large number of small areas this cluster randomization should give reliable results, but with a small number of very large areas, as would be likely in the example given, there are problems in ensuring the comparability of the areas. Here it is valuable to obtain baseline data before the study starts so that changes within areas over the time of the study can be compared. Other clusters sometimes used in experimental research are schools, hospitals and families.

与对个体的治疗比较一样,对区域的随机研究比非随机研究能提供更可靠的结果,但随机化常常是不可能的。美国关于饮用水氟化与癌症可能关联的大部分争议,源于有无氟化的区域特征不同。
As with treatment comparisons on individuals, randomized studies on areas will give more reliable results than non- randomized studies, but randomization is often impossible. Much of the controversy over the possible association between the fluoridation of drinking water and cancer in the United States was due to the different characteristics of areas which did or did not have fluoride.

随机化也可以在实验中以其他方式使用。在臂部比较研究中,两个观察者和两个袖带在每只手臂上的使用顺序是随机的,以防存在某种系统性的顺序效应。在可能存在某种系统性不良效应(即偏倚)的情况下,采用平衡随机化是个好主意。如果最终发现没有这种效应,也不会造成任何损害。
Randomization can also be used in other ways in experiments. In the arm comparison study the order in which the two observers and two cuffs were used on each arm was randomized in case there was some systematic order effect. It is a good idea to use balanced randomization in situations where there is the possibility of some systematic unwanted effect (that is, a bias). No harm will be done if it turns out that there was no such effect.

在动物实验中使用随机化也是可取的(Gart 等,1986)。例如,如果要给小鼠施以两种或多种不同的处理,最好一次选取一只,并使用随机序列来决定其接受的处理。因为从笼子中先取出的动物与最后留下的动物之间,体型可能存在差异(Festing,1981)。不同笼子中的动物也可能存在系统性差异,因此每个笼子应包含接受各处理的动物。
It is also advisable to use randomization in animal experiments (Gart et al., 1986). For example, if mice are to be given one of two or more different treatments it is best to select them one at a time and use a random sequence to determine the treatment. There are likely to be size differences between those animals pulled out first from the cage and those left to the end (Festing, 1981). There may also be systematic differences between animals in different cages, so that each cage should contain some animals given each treatment.

同样,随机化在实验室实验中也有作用,比如分析经过不同处理(如照射)的样本时。如果样本是在一个连续过程中分析的,比如使用库尔特计数器测量全血样本中的血红蛋白和白细胞计数,那么分析顺序最好相对于不同处理的样本进行随机化。
Likewise randomization has a role in laboratory experiments, such as when analysing samples that have been treated differently (e.g. by irradiation). If the samples are analysed in a continuous process, such as when using a Coulter counter to measure haemoglobin and white cell counts in samples of whole blood, then the order of analysis should preferably be randomized in relation to the differently treated samples.

在某些实验中,样本是分批分析的,并且一次处理的样本数量受到物理限制。建议每批中各类型样本的数量相等。此外,如果不同位置之间可能存在系统性差异,则样本的位置也应随机化。例如,不同类型的样本可以随机分配到一个 板中编号的孔位。
In some experiments samples are analysed in batches and there are physical constraints on the number that can be dealt with in one go. It is advisable to have equal numbers of each type of sample in each batch. Further, if there is the possibility of systematic differences between the different locations, then the positions of the samples should also be randomized. For example, different types of sample can be randomly allocated to the numbered wells in a plate.

5.8 最小化 5.8 MINIMIZATION

【5】8 最小化唯一可接受替代随机化的分配方法是最小化,这是一种巧妙的方法,能够确保即使在小样本中,各组在多个预后因素上都保持良好的平衡。其原理是:下一位进入试验的患者将以大于0.5的概率接受能够使试验各组间总体不平衡最小的处理。通常该概率取为1,但取一个大于0.75的值也能达到类似效果,同时带有随机成分的优势。该方法的详细内容见第15.2.3节,因为该技术主要用于临床试验。
5.8 MINIMIZATIONThe only form of allocation that is an acceptable alternative to randomization is minimization, which is a clever method of ensuring excellent balance between the groups for several prognostic factors, even in small samples. It is based on the idea that the next patient to enter the trial is given, with probability greater than 0.5, whichever treatment would minimize the overall imbalance between the groups at that stage of the trial. Often the probability is taken as 1, but a value greater than, say 0.75, should achieve much the same result with the advantages of a random component. Details of the method are given in section 15.2.3, as the technique is mainly used in clinical trials.

5.9 观察性研究 5.9 OBSERVATIONAL STUDIES

如图5.1所示,观察性研究可以采取不同的形式。许多研究旨在探讨各种因素与特定疾病或状况发展的可能关联。例如,研究被动吸烟与肺癌的关系、使用视觉显示终端与流产的关系、饮酒与自杀的关系。比较接受不同治疗的两组患者的结果与比较接受不同暴露的两组结果在逻辑上没有区别。然而,流行病学研究领域如上述例子通常不适合通过随机试验来研究。我们无法随机分配个体是否吸烟或从事特定工作,年龄和种族等因素也无法由个体控制。因此,我们必须使用观察性研究来研究调查者无法控制的因素或暴露。然而,正如Gray-Donald和Kramer(1988)所言,“观察性研究的目标应是得出与实验试验相同的结论”。
As shown in Figure 5.1, observational studies can take different forms. Many studies are carried out to investigate possible associations between various factors and the development of a particular disease or condition. Examples are studies of the relation between passive smoking and lung cancer, the use of visual display terminals and miscarriage, and alcohol consumption and suicide. There is no logical difference between comparing the outcome of two groups of patients given alternative treatments and comparing the outcome of groups receiving different exposures. In general, however, areas of epidemiological research such as those listed above are not amenable to being investigated by randomized trials. We cannot randomize individuals to smoke or not to smoke nor to work in particular jobs, and other factors such as age and race are not controllable by the individual. We must use observational studies, therefore, to study factors or exposures which cannot be controlled by the investigators. Nevertheless, as stated by Gray- Donald and Kramer (1988), 'the goal of an observational study should be to arrive at the same conclusions that would have been obtained by an experimental trial'.

用于调查因果因素的观察性研究主要有两种类型—病例对照研究和队列研究。图5.6展示了这些设计的基本结构。在回顾性病例对照研究中,确定一组患有相关疾病的受试者(病例)及一些未受影响的受试者(对照),然后比较这两组在感兴趣暴露因素方面的既往历史。相比之下,前瞻性队列研究中,确定一组受试者并进行长期随访,记录其后续的病史。队列可在研究开始时按不同特征进行分组,或用于调查哪些受试者最终发展为特定疾病。(还有历史队列研究,即确定过去的队列,且
There are two main types of observational study that are used to investigate causal factors - the case- control study and the cohort study. Figure 5.6 indicates the basic structure of these designs. In a retrospective case- control study a number of subjects with the disease in question (the cases) are identified along with some unaffected subjects (controls). The past history of these groups in relation to the exposure(s) of interest is then compared. In contrast, in a prospective cohort study a group of subjects is identified and followed prospectively, perhaps for many years, and their subsequent medical history recorded. The cohort may be subdivided at the outset into groups with different characteristics, or the study may be used to investigate which subjects go on to develop a particular disease. (There is also the historical cohort study, in which a past cohort is identified, and

  • 队列研究:*
  • Cohort Study:*

疾病经历是前瞻性收集的
disease experience is collected prospectively

  • 病例对照研究:*
  • Case-Control Study:*

病例和对照的既往经历被回顾性收集
past experience of cases and controls is recalled

  • 横断面研究:*
  • Cross-Sectional Study:*

既往经历和当前疾病状态同时收集
past experience and current disease status are collected at the same time

图5.6 病例对照研究、队列研究和横断面研究的基本结构。
Figure 5.6 Basic structure of the case- control study, the cohort study and the cross- sectional study.

他们迄今为止的经历被收集。这样的研究很少进行,因为所需数据很少可得。)图5.6中还显示了横断面研究,其中受试者仅在一次时间点接受调查。回顾性病例对照研究、前瞻性队列研究和横断面研究的优缺点将在接下来的三节中描述。
their experience up to the present is obtained. Few studies like this are carried out as the necessary data are rarely available.) Also shown in Figure 5.6 is the cross- sectional study, in which subjects are investigated on one occasion only. The advantages and disadvantages of the retrospective

病例对照研究、前瞻性队列研究和横断面研究的优缺点将在接下来的三节中描述。
case- control study, the prospective cohort study and the cross- sectional study are described in the next three sections.

5.10 病例对照研究 5.10 THE CASE-CONTROL STUDY

如图5.6所示,在病例对照研究中,我们确定一组患有感兴趣疾病或状况的受试者(病例),如肺癌,以及一组未受影响的受试者(对照),并比较他们过去对一个或多个感兴趣因素的暴露情况,如胡萝卜摄入量。如果病例报告的暴露量高于对照组,我们可能推断该暴露与疾病存在因果关系,例如胡萝卜摄入影响肺癌的发病风险。
As shown in Figure 5.6, in the case- control study we identify a group of subjects (cases) with the disease or condition of interest, say lung cancer, and an unaffected group (controls), and compare their past exposure to one or more factors of interest, such as consumption of carrots. If the cases report greater exposure than the controls we may infer that exposure is causally related to the disease of interest, for example that consumption of carrots affects the risk of developing lung cancer.

病例对照方法的主要优势是实用性:相对简单,因此快速且成本低廉。当感兴趣的疾病非常罕见时,病例对照设计也非常有价值。然而,这种设计的缺点也很重要,主要涉及病例与对照比较时可能出现的偏倚。Sackett(1979)指出病例对照研究中可能出现多达35种不同的偏倚;下面描述了一些主要偏倚。
The prime advantages of the case- control approach are practical: it is relatively simple, and thus quick and cheap. The case- control design is also valuable when the condition of interest is very rare. The disadvantages of this design are important, however, and relate to possible biases in the comparison of cases and controls. Sackett (1979) identified as many as 35 different biases that can occur with case- control studies; some of the main ones are described below.

5.10.1 对照组的选择 5.10.1 Selection of controls

病例对照研究的主要难点是选择合适的对照组。如果借鉴随机临床试验的类比,我们希望对照组与病例尽可能相似,唯一不同的是他们没有被研究的疾病。然而,获得这样一组对照并不简单。没有感兴趣结局的受试者在其他方面可能与病例不同,尤其是在暴露因素方面可能不典型。例如,当病例是患有特定疾病的住院患者时,常用同一家或多家医院中患有不同疾病的患者作为对照。住院患者可能患有其他也受暴露因素影响的疾病。例如,在肺癌与吸烟的研究中,使用医院对照可能导致关系的低估,因为许多其他疾病也与吸烟有关。在肺癌与胡萝卜摄入的研究中,这种偏倚则不太可能出现(Pisani等,1986),但饮食可能受其他疾病影响或导致其他疾病。
The main difficulty with the case- control study is the selection of an appropriate control group. If we follow the analogy with the randomized clinical trial, we want the controls to be as similar as possible to the cases, except that they do not have the disease being investigated. Obtaining such a group, however, is not straightforward. Subjects who do not have the outcome of interest may well differ in other ways from the cases, and in particular may be atypical with regard to the exposure of interest. For example, when the cases are hospital patients with a particular condition it is common to take as controls patients in the same hospital(s) with different conditions. Patients in hospital may be expected to have other conditions that are also affected by the exposure of interest. For example, in a study of lung cancer and smoking, use of hospital controls may well lead to an underestimate of the relation because many other medical conditions are related to smoking. This bias would not appear so likely in a study of lung cancer and consumption of carrots (Pisani et al., 1986), but diet may be affected by or may lead to other medical conditions.

尤其是,四组之间不同的住院率可能引发问题:暴露且为病例、未暴露且为病例、暴露且为对照以及未暴露且为对照。Berkson于1946年从理论上提出了这种偏倚,但直到1978年才有实证证明(Roberts等,1978)。
In particular, problems can arise from different hospital admission rates among four groups: exposed and unexposed cases and exposed and unexposed controls. This bias was postulated on theoretical grounds by Berkson in 1946, but was not demonstrated empirically until 1978 (Roberts et al., 1978).

另一种方法是选择社区对照,从非住院人群中挑选受试者。然而,从普通人群中选择具有代表性的对照组并不简单,尤其是在需要特定年龄和性别分布时。
The alternative approach is to select community controls, choosing subjects from the non- hospitalized population. It is, however, not straightforward to select a representative control group from the general population, especially if, for example, a certain age and sex distribution is required.

健康人参与研究的意愿通常低于住院患者,这会引入进一步偏倚。有些研究同时使用医院对照和社区对照,当对医院对照的有效性存疑时,这是一种理想的做法。
There is also likely to be less willingness among healthy people to participate in a study than among hospital patients, which would introduce a further bias. Some studies use both hospital controls and community controls, which is a desirable approach when there is doubt about the validity of hospital controls.

使病例组和对照组更具可比性的一种方法是对某些可能混淆比较的变量进行匹配。匹配意味着每个病例都与一个对照个体一一配对。例如,对于每个病例,我们可能会寻找一个年龄、性别和职业相同的对照对象。然而,匹配仅对那些与暴露和感兴趣的结局均有强相关性的变量有用。此外,重要的是要理解,任何用于匹配的变量都不能作为可能的结局风险因素进行研究。因此,如果我们针对是否为素食者对心肌梗死(MI)患者(病例)与非MI对照进行个体匹配,那么即使存在心肌梗死与食肉习惯之间的关联,我们也无法发现。
One way to make the cases and controls more comparable is to match for some variables that might confuse the comparison. Matching means that each case is individually paired with a control subject. For example, for each case we might seek a control subject of the same age, sex and occupation. Matching is only useful, however, for variables that are strongly related to both the exposure and the outcome of interest. Further, it is important to appreciate that any variable used for matching cannot be investigated as a possible risk factor for the outcome. Thus if we individually match post myocardial infarct (MI) patients (cases) with non- MI controls with respect to whether or not they are vegetarian, we cannot find an association between MI and meat- eating if there is one.

对于罕见事件,可以通过增加对照数量来增强研究的力量。在使用匹配时,每个病例可以有多个匹配的对照。例如,Cuckle 等人(1986年)比较了唐氏综合征婴儿脐带血清中甲胎蛋白的水平及对照组。对于每个唐氏综合征婴儿,他们选择了三个在婴儿出生时的孕周和血清样本储存时间上匹配的对照。
For rare events, the strength of the study can be increased by having more controls than cases. Where matching is used each case can have several matched controls. For example Cuckle et al. (1986) compared the level of alpha- fetoprotein in stored serum from the umbilical cords of Down's syndrome babies and controls. For each Down's baby they took three controls matched for the baby's gestational age at delivery and duration of storage of the serum samples.

5.10.2 病例的选择 5.10.2 Selection of cases

对照组的选择是一个主要问题,但病例的选择也应仔细考虑。虽然将所有糖尿病患者归为一组可能合理,但许多疾病(如大多数癌症)在病因、性质和程度上是异质的。病例的选择(包括疾病类型和年龄等因素)决定了结果的推广性程度。
The selection of controls is a major problem, but the selection of cases should also be considered carefully. While it may be reasonable to group together all diabetics, many diseases such as most cancers are heterogeneous in cause, nature and degree. The choice of cases with respect to type of disease and other factors such as age determines the degree of generalizability of results.

5.10.3 回忆偏倚 5.10.3 Recall bias

另一个重要的偏倚来源是病例组与对照组之间的回忆差异偏倚。在许多病例对照研究中,回顾性信息是通过访谈受试者获得的。患有某种疾病或状况的人可能会更多地思考其过去行为与疾病之间的可能联系,尤其是针对广泛宣传的风险因素。
Another important source of bias is that due to differential recall by cases and controls. In many case- control studies retrospective information is obtained by interviewing the subjects. People with a particular disease or

例如,经历流产的女性可能比妊娠足月的女性更倾向于报告接触可能的危险因素,如使用视频显示终端。因此,这类研究可能反映的是风险的感知,而非真实的风险。
condition may have thought a lot about a possible link with their past behaviour, especially with respect to widely publicized risk factors. For example, women having a miscarriage may be more likely to report exposure to possible hazards, such as use of a video display terminal, than women whose pregnancies went to term. Such a study may thus reflect perception of risk rather than a true risk.

尽管这种偏倚不一定总是存在(Mackenzie 和 Lippman,1989),但病例对照研究中回忆偏倚的可能性极大。通常,这种偏倚源于对照组暴露情况的低报告。通常缺乏可供核查的记录,但应努力评估并尽量减少回忆偏倚的影响。
Although it may not always be present (Mackenzie and Lippman, 1989), there is enormous scope for recall bias in case- control studies. In general the bias is due to under- reporting of exposure in the control group. Usually there are no records against which to check reports, but efforts should be made to evaluate and minimize the effect of recall bias.

5.10.4 回顾性数据的不准确性 5.10.4 Inaccuracy of retrospective data

除了偏倚的事件回忆外,回忆信息的一般性不准确性也是可能存在的问题。需要回忆详细饮食或吸烟习惯的研究容易出现此类问题,要求对受试者工作历史进行精确划分以评估对某种危害的总暴露的研究亦然。
In addition to biased recall of events, there is the possibility of a general inaccuracy in recalled information. Studies requiring recall of detailed dietary or smoking habits are prone to this problem, as are those requiring a precise breakdown of subjects' working history to evaluate total exposure to a hazard.

虽然在大量受试者的回忆信息中,暴露的回忆通常不会有普遍的高估或低估倾向,但回忆错误引入的“噪声”确实会导致暴露与感兴趣结局之间的关联被低估(Breslow 和 Day,1987,第41页)。通常很难改善长期回忆数据的准确性。
While there may be no general tendency to over- or under- estimate exposure in the recalled information from a large number of subjects, the 'noise' introduced by errors in recall do have the effect of leading to an underestimate of the association between the exposure and the outcome of interest (Breslow and Day, 1987, p. 41). There is not usually much that can be done to improve the accuracy of long- term recall data.

一个相关的问题是,从医院病历中获得的数据常因信息缺失和病历缺失而不完整。
A related problem is that data obtained from hospital notes will suffer from incompleteness due to missing information and missing notes.

5.10.5 确认偏倚 5.10.5 Ascertainment bias

另一种偏倚形式可能源于暴露与检测感兴趣事件概率之间的关系。例如,服用口服避孕药的女性比未服用者更频繁接受宫颈涂片检查,因此如果存在宫颈癌,更可能被早期发现(且往往在更早阶段被发现)。因此,在比较宫颈癌患者与对照组的病例对照研究中,病例组中避孕药使用过多可能(至少部分)是由于与更频繁筛查相关的确认偏倚(或检测偏倚)。
Another form of bias can arise through a relation between the exposure and the probability of detecting the event of interest. For example, women taking the oral contraceptive pill will have more frequent cervical smears than women not on the pill, and as a consequence are more likely to have cervical cancer detected if it is present (and it is likely to be detected at an earlier stage). Thus in a case- control study comparing women with cervical cancer and a control group, an excess of pill taking among the cases may be (at least partly) due to the ascertainment bias (or detection bias) related to more frequent screening.

5.10.6 评述 5.10.6 Comment

上述问题仅是病例对照研究中最明显的困难。更详细的讨论可见于
The problems discussed are only the most obvious difficulties associated with case- control studies. More detailed discussion can be found in

流行病学教科书,如 Breslow 和 Day(1980)及 Schlesselman(1982)。病例对照研究非常有价值,但在设计、分析和解释时需极其谨慎。偏倚的广泛可能性是规划阶段寻求流行病学和统计学专家合作的重要原因。有人指出,同一课题的病例对照研究结果多有矛盾,原因在于设计时未严格遵循科学原则(Mayes 等,1988)。
epidemiology textbooks, such as Breslow and Day (1980) and Schlesselman (1982). Case- control studies can be very valuable, but much care is needed in their planning, analysis and interpretation. The considerable scope for bias is a strong reason for seeking expert epidemiological and statistical collaboration at the planning stage. It has been suggested that many contradictory results from case- control studies of the same topic are due to the lack of adherence to rigorous scientific principles in their design (Mayes et al., 1988).

无论多么仔细排除偏倚来源,病例对照研究中观察到的结局与风险因素的关联都必须谨慎解读。特别是,不能将此类发现视为必然的因果关系。观察性研究只能提示可能的因果联系—还需其他研究深入探讨。例如,Mattila 等(1989)发现牙齿健康差与急性心肌梗死相关。虽然作者提出了可能的因果解释,但观察到的关联可能是因为牙齿健康差的人通常整体自我照顾较差,例如饮食方面。显然,收集可能的混杂变量信息并纳入分析是有益的。
However carefully sources of bias have been excluded the observation in a case- control study of an association between an outcome and a risk factor must be interpreted with much care. Specifically, it is wrong to take such a finding as necessarily indicating a causal link. Observational studies cannot do more than suggest possible causal links - other research is needed to investigate these ideas more deeply. For example, Mattila et al. (1989) found an association between poor dental health and acute myocardial infarction. While the authors advanced a possible explanation for a causal link, the observed association might be because people with poor dental health tend to look after themselves poorly in general, for example with respect to their diet. Clearly it helps to collect information on possible confounding variables, which can be incorporated into the analysis.

5.11 队列研究 5.11 THE COHORT STUDY

前瞻性队列研究(或随访研究、纵向研究)是观察性研究的首选方法,但此设计也存在一定困难。队列研究的核心是确定一组感兴趣的受试者并随访观察其结局。由于需观察未受影响个体直到相当比例发生感兴趣结局,队列研究时间较长,费用较高。它们通常不适合研究罕见结局,因为需要随访大量受试者才能获得足够事件数。
The prospective cohort study (or follow- up or longitudinal study) is the method of choice for an observational study, but there are certain difficulties with this design too. The essence of the cohort study is to identify a group of subjects of interest and then follow them up to see what happens. Because of the need to observe unaffected individuals until a fair proportion develop the outcome of interest, cohort studies can take a long time and may thus be very expensive. They are usually unsuitable for studying rare outcomes as it would be necessary to follow a huge number of subjects to get an adequate number of events.

通常关注的是一个特定事件,如死亡或疾病复发,但也可能有多个事件。研究开始时可能会确定若干子群体,其经历将被比较,比如吸烟者与非吸烟者,或不同乳腺癌分期的患者。或者,研究目的可能是利用获得的信息,试图识别最有风险发生感兴趣结局的个体。例如,我们可以随访肝硬化患者,识别在十年内发展为肝癌的患者,并将其特征与未患癌者进行比较。由于研究是前瞻性的,数据记录的性质和质量可以被严格控制。
There is usually one particular event of interest, such as death or recurrence of disease, but there may be several. There may be subgroups of subjects identified at the outset whose experience is to be compared. such as smokers and non- smokers or patients with different stages of breast cancer. Alternatively the purpose of the study may be to use the information gained to try to identify those subjects most at risk of developing the outcome of interest. For example, we could follow patients with cirrhosis of the liver, identify those developing carcinoma of the liver over, say, ten years, and compare their characteristics with those who do not get a carcinoma. Because the study is prospective the nature and quality of the data recording can be carefully controlled.

Breslow 和 Day(1987,第15-20页)总结了队列研究相较于病例对照研究的优势。
Breslow and Day (1987, pp. 15- 20) summarize the advantages of cohort

然而,队列研究也存在一些问题。研究对象的选择是所有研究中常见的问题,下面将与随访研究中三个特有的问题一起讨论。
studies over case- control studies. There are some problems with cohort studies, however. Selection of the subjects to study is a common problem with all research, and is discussed below along with three problems specific to follow- up studies.

5.11.1 研究对象的选择 5.11.1 Selection of subjects

研究对象的选择在所有研究中都非常重要。在随访研究中,感兴趣事件发生的概率可能与样本的获取方式密切相关。Ellenberg 和 Nelson(1980)对儿童发热性惊厥不良预后频率的已发表研究进行了综述,清晰地体现了这一点。他们观察到发热性惊厥发生于2%至4%的幼儿中,鉴于长期抗惊厥治疗可能带来不良后果,量化再次发作的风险显得尤为重要。
The selection of subjects to study is important in all research. In follow- up studies the probability of the event of interest occurring may be strongly related to how the sample was obtained. The issues are clearly seen in a review by Ellenberg and Nelson (1980) of published studies of the frequency of an adverse prognosis in children having a febrile seizure. They observed that such seizures occur in to of all young children, and as there may be harmful consequences of long- term anti- convulsant therapy it was important to quantify the risk of further seizures.

他们回顾了23项确定非发热性惊厥后续风险的研究。其中17项研究的儿童是在专科门诊或医院急诊室中识别的,另外6项则采用了人群样本,试图识别并随访特定时间内在特定人群中经历发热性惊厥的所有儿童。发热性惊厥的患病率可能因地区而异,不同研究采用的方案也可能影响结果。然而,我们预期不同的人群基础研究结果应相似。相比之下,基于门诊的研究不可避免地偏向高风险儿童,因为它们仅收治较严重病例。偏倚程度会因当地转诊模式和替代设施而异。因此,我们预期门诊研究显示的复发率较高且变异较大,Ellenberg 和 Nelson的发现正是如此。七项基于人群的研究报告的复发率为1.5%至4.6%(中位数3.0%),而17项基于门诊的研究复发率介于2.6%至76.9%(中位数16.9%)。这些较高的复发率估计导致许多儿童接受预防性治疗;而人群基础研究获得的较低复发率则反对这种治疗。
They reviewed 23 studies in which the risk of subsequent nonfebrile seizures had been ascertained. In 17 studies the children had been identified in special clinics or hospital emergency rooms. The other six had taken population samples, in which the investigators attempted to identify and follow up all children in a defined population who experienced a febrile seizure in a certain time period. It is likely that the prevalence of febrile seizures varies from one area to another, and we would expect some effect of different protocols in the different studies. Nevertheless we would expect different population- based studies to give similar results. In contrast, the clinic- based studies will inevitably be biased towards higher risk children because they will only see the more serious cases. The extent of the bias will be variable according to local referring patterns and alternative facilities. We would thus expect the clinic- based studies to show higher and more variable recurrence rates than the population based studies, and this is exactly what Ellenberg and Nelson found. The seven population- based studies obtained recurrence rates of from 1.5 to (median ), whereas the 17 clinic- based studies found rates between and (median ). These large estimated recurrence rates had led to many children being treated prophylactically; the much smaller rates obtained in the population- based studies argued against such treatment.

在其他疾病的随访研究中,样本选择同样可能导致类似的结果差异。然而,在某些情况下,研究专科门诊患者可能呈现出乐观的结果。例如,新生儿囊性纤维化和心肌梗死患者中,部分病例可能因病情严重而无法活到门诊就诊。人群样本调查难度大且费用高,但高度选择性对象的研究可能产生误导性结果,尤其是在疾病自然史方面。
Similar differences in outcome in relation to sample selection would be likely in follow- up studies of other medical conditions. In some cases, however, studying attenders at special clinics may give an optimistic picture. Examples are cystic fibrosis in newborn babies and myocardial infarction, for both of which some cases will not live long enough to be able to attend a clinic. Population samples are difficult and expensive to carry out, but studies of highly selected subjects may well give misleading results, especially regarding the natural history of disease.

5.11.2 随访失访 5.11.2 Loss to follow-up

队列研究中遇到的主要困难是部分受试者无法完成整个随访周期。他们可能搬迁、失去兴趣,甚至死亡。研究时间越长,失访人数越多。失访减少了提供信息的样本量,稍微削弱了分析的效力。但主要担忧是失访与研究的结局或预先定义的风险类别相关联。这种偏倚风险较大,因此必须尽力联系尽可能多的受试者。虽然部分失访不可避免,但比较失访者入组时的特征与保持联系者的特征是有益的。
The main difficulty specifically encountered in cohort studies is that some subjects will not be followed up for the full length of the study. They may move to another area or lose interest, or they may even die. The longer the study, the more subjects will be lost. Losses to follow- up reduce the numbers supplying information, and thus weaken the analysis slightly. The main worry, however, is that subjects are lost to follow- up for some reason that is related to the outcomes being studied or to pre- defined risk categories. There is a considerable risk of this type of bias, and so strenuous efforts are needed to try to contact as many people as possible. Some losses are inevitable, and it is useful to compare the characteristics of these subjects on entry to the study with those with whom contact is maintained.

即使随访时间较短,也会因各种原因出现失访,其中部分可能与研究目标相关。Martin 和 Bracken(1987)在纽黑文识别了6219名孕妇,作为研究母亲咖啡因摄入与出生体重关系的潜在对象。其中5331名同意接受联系,4926名符合研究资格。用于主要分析的数据人数减少至3858人,排除原因如下:
Even with a short follow- up period there will be losses for various reasons, some of which might be related to the aim of the research. Martin and Bracken (1987) identified 6219 pregnant women in New Haven for possible inclusion in a study to investigate the relation between maternal caffeine consumption and birth weight. Of these, 5331 women agreed to be contacted, and 4926 were eligible for the study. The number yielding data for the main analysis was reduced to 3858, with the following reasons for exclusion:

4926 符合条件且愿意参与研究
4926 eligible and willing to be in study

473 拒绝接受访谈
473 refused to be interviewed

263 无法联系上
263 could not be reached

4 访谈不可靠
4 unreliable interviews

4186 获得有效访谈
4186 valid interviews obtained

76 妊娠结局未确定
76 pregnancy outcome not ascertained

56 在其他医院分娩
56 delivered at a different hospital

116 非活产
116 not a live birth

46 非单胎分娩
46 not singleton deliveries

33 未记录出生体重
33 birth weight not recorded

获得了3858份关于咖啡因摄入量和出生体重的数据。
3858 caffeine consumption and birth weight obtained.

这项研究展示了随访不完整的多种原因。虽然看起来这些失访原因与咖啡因摄入量或出生体重的重要关联性不大,但偏倚的可能性始终应予考虑。
This study illustrates the wide range of reasons for incomplete follow- up. It may not seem likely that any of these reasons for loss to follow- up would have been related to either caffeine consumption or birth weight to an important degree, but the possibility of bias should always be considered.

在多年进行的研究中,尤其在人口流动性大的群体中,大量受试者可能失访,这会严重削弱结果的可靠性。邮寄问卷的无响应尤为常见。然而,如果关注的结局是死亡,国家登记系统可以提供未保持联系受试者的信息。同样,在某些国家,疾病登记系统几乎实现了完全随访。例如,在一项针对1969-70年所有瑞典应征者的研究中,
In studies carried out over many years large numbers of subjects may be lost, especially in highly mobile populations, severely weakening the reliability of the results. Non- response to postal questionnaires is particularly common. If the outcome of interest is death, however, national registers can provide information about subjects who have not maintained contact. Similarly, in some countries disease registers allow virtually complete follow- up. For example, in a study of all Swedish conscripts in

登记系统被用来识别精神病护理入院和死亡情况(Andréasson等,1987)。
1969- 70, registers were used to identify both admissions for psychiatric care and deaths (Andréasson et al., 1987).

5.11.3 其他问题 5.11.3 Other problems

长期研究可能面临习惯变化带来的问题。例如,人们可能更换工作(从而改变风险暴露),失业,或改变香烟、酒精或特定食物的摄入量。然而,队列研究的优势之一是可以对风险状态进行重复评估。
Long- term studies may suffer from problems associated with change in habits. For example, people may change jobs (and hence exposure to risk) or become unemployed, or may change the consumption of cigarettes, alcohol or specific items of food. It is, though, a strength of the cohort study that repeated assessments of risk status can be made.

也许更严重的问题是不同群体可能未被同等仔细地调查。特别是高风险群体可能被更细致地研究,导致医疗问题更早被发现,从而获得优势。相反,对高风险群体的深入调查可能导致发现的疾病实际上在低风险群体中同样常见。只有当所有受试者接受相同的调查,且评估者对每个人的风险状态一无所知时,监测偏倚才能被消除。
Perhaps a more serious problem is that different groups may not be investigated equally closely. In particular a high risk group may be studied more carefully, resulting in advantageous earlier detection of medical problems. Conversely, intensive investigation of the high risk group may lead to the greater discovery of conditions that are actually equally common in the low risk group. Surveillance bias is eliminated when all subjects are investigated identically, preferably with the assessors being unaware of each person's risk status.

5.12 横断面研究 5.12 THE CROSS-SECTIONAL STUDY

在队列研究中,识别具有不同特征的受试者并随访观察其结果。相比之下,横断面研究在同一时间收集所有信息,因为受试者只被联系一次。许多横断面研究是描述性的,通常称为调查。例如,我们可能询问本科生的酒精消费情况,进行某一地区替代医学使用的调查,或研究某种血液检测在有特定症状住院患者中提供正确“诊断”的能力。
In a cohort study subjects with different characteristics are identified and followed to see what happens. By contrast, in a cross- sectional study all the information is collected at the same time because subjects are only contacted once. Many cross- sectional studies are descriptive, and these are often called surveys. For example, we might ask undergraduates about their alcohol consumption, carry out a survey of the use of alternative medicine in a particular area, or investigate the ability of a particular blood test to give a correct 'diagnosis' in inpatients with certain symptoms.

然而,一些横断面研究旨在调查疾病与可能风险因素之间的关联,因此这种设计是病例对照和队列研究的替代方案。横断面研究避免了许多影响其他设计的困难,如回忆偏倚和失访。它相对便宜且易于实施。当然,横断面研究也存在其特有的问题。
Some cross- sectional studies are, however, carried out to investigate associations between a disease and possible risk factors, so that this design is an alternative to the case- control and cohort approaches. The cross- sectional study does not suffer from many of the difficulties that affect these other designs, such as recall bias and loss to follow- up. It is relatively cheap and easy to carry out. Needless to say, there are different special problems associated with cross- sectional studies.

5.12.1 样本选择 5.12.1 Sample selection

横断面研究在样本选择上与队列研究存在相同的问题。尽管研究是在有限的个体上进行,但结果的解释通常会被广泛推广。例如,对某一县全科医生转诊行为或健康教育的调查,往往被视为全国情况的代表。然而,医院住院患者、门诊就诊者、全科就诊者及未就诊者的性质可能差异极大。
Cross- sectional studies share the problems of sample selection with cohort studies. Although research is carried out on a limited number of individuals, the interpretation of results is usually extended widely. A survey of GP referral practices or health education in one county will probably be taken as an indication of what happens nationally. However, the nature of

除了影响观察到的疾病患病率外,样本的选择还可能对观察到的与其他因素的关系产生强烈影响。显然,推断的有效性关键取决于样本的代表性。大多数观察性研究的一个内在弱点是样本不具代表性。然而,在某些情况下,我们可以为调查选择随机样本,这是理想的方法。
hospital inpatients, clinic attenders, general practice attenders and those not attending anywhere may vary enormously. Apart from affecting the observed prevalence of a disorder, the choice of sample may have a strong effect on the observed relation with other factors. Clearly, the validity of the extrapolation depends crucially on the representativeness of the sample. It is an inherent weakness of most observational studies that the sample is not representative of the population. In some cases, however, we can select a random sample for a survey, which is the ideal method.

5.12.2 反应率 5.12.2 Response rates

许多横断面研究的信息主要或全部来自邮寄问卷。非应答是一个大问题,问卷回收率可能只有50%到80%。许多研究发现,回应者与未回应者在人口学和健康相关特征上存在显著差异,未回应者通常健康状况较差。这种现象有时被称为志愿者偏倚。如果对未回应者有部分信息—例如基本人口学资料—评估回应者与未回应者之间是否存在明显差异是有价值的。然而,年龄和性别分布相似并不一定意味着不存在偏倚。
Many cross- sectional studies obtain all or most of their information from postal questionnaires. Non- response can be a big problem, with perhaps only to of questionnaires being returned. Many studies have found that there are marked differences (demographic and health- related) between those who do or do not respond to a questionnaire, with the non- responders usually being less healthy. This is sometimes known as volunteer bias. If some information is available for non- responders - perhaps basic demographic details - it is valuable to assess whether there are any apparent differences between responders and non- responders. Similar age and sex distributions will not, however, necessarily indicate a lack of bias.

例如,在一项针对老年人的健康状况调查中,反应率与年龄相关,85岁及以上组反应率最高(84%),65至74岁组最低(74%)(Rockwood等,1989)。然而,未回应者住院时间较回应者更长,这种差异在最高龄组尤为明显。
For example, in a health status survey of elderly people the response rate was age related, being highest in those aged 85 and over and lowest in those aged 65 to 74 (Rockwood et al., 1989). However, non- responders were found to spend more time in hospital than responders, and this difference was most marked in the oldest group.

在任何研究中,都应尽最大努力提高反应率。例如,在通过邮寄问卷收集数据的研究中,通常会对未回应第一封信的人进行第二次和第三次邮寄。
In any study strenuous efforts should be made to get as high a response rate as possible. For example, in studies collecting data by postal questionnaire it is common to have second and third mailings for those who do not respond to the first letter.

5.12.3 因果关系? 5.12.3 Cause or effect?

横断面研究在探讨疾病相关性时的一个特殊难点是疾病与可能风险因素的时间顺序。例如,如果我们研究就业状况与健康的关系,可能会发现失业者健康状况比在职者差。我们可能得出失业导致健康变差的结论,但同样合理的可能是健康状况差导致失业,或者两者都成立。由于我们是在同一时间收集两组信息,无法明确推断因果关系。类似情况在疾病发展缓慢或暴露时间较长(或两者兼有)的情况下经常出现。一些病例对照研究也存在同样的缺陷。前瞻性研究是调查此类问题的最佳方法。
The particular difficulty associated with cross- sectional studies looking at associations with disease concerns the sequence in time of the disorder of interest and the possible risk factor. For example, if we were to carry out a study of the relation between employment status and health we would probably find that the unemployed have worse health than those in employment. We might conclude that being unemployed leads to poorer health, but an equally valid possibility is that poor health leads to being unemployed, or both statements might be true. Because we have collected both sets of information at the same time we cannot draw a clear inference of causality. Similar situations arise in many circumstances where either the

疾病发展缓慢或暴露长期存在的情况下(或两者兼有),类似的问题也会出现。一些病例对照研究同样存在这一弱点。前瞻性研究是探讨此类问题的最佳途径。
disorder develops slowly or the exposure is long- term (or both). Some case- control studies suffer from the same weakness. A prospective study is the best way to investigate such questions.

5.13 时间变化的研究 5.13 STUDIES OF CHANGE OVER TIME

本章最后讨论的研究设计类型是利用两个或多个独立的横断面数据集来推断随时间的变化。两个使用该设计的情境将说明许多困难。
The last type of study design considered in this chapter is that in which two or more independent sets of cross- sectional data are used to make inferences about changes over time. Two situations where this design is used will illustrate many of the difficulties.

第一个例子是研究生长模式时,无法对每个个体进行多次测量的情况。例如,胎儿的超声测量现在在许多医院已成为常规,了解胎儿大小各项测量(如头围)的通常变异性非常重要。已经进行了许多此类研究。除了样本选择的常见问题外,这些研究往往包含来自不同胎儿的测量次数不等。大多数孕妇在妊娠15-20周左右仅做一次超声检查。重复扫描通常仅在临床有疑虑(如生长明显迟缓)时进行。因此,纳入这些数据会使样本偏向这些胎儿,尤其影响妊娠后半期的数据。另一个问题是,这类数据通常被绘图处理,连接各妊娠周均值的线被视为平均“生长曲线”。然而,均值反映的是平均大小,而非平均生长;根据定义,我们需要对每个胎儿进行两次或以上的测量,才能研究生长。单次大小测量无法有效推断生长;不能从横断面数据创造出纵向研究。
The first example is in the study of growth patterns when it is not possible to take many measurements from each individual. For example, ultrasound measurements of the fetus are now routine in many hospitals, and it is important to know the usual variability of the various measurements of fetal size such as head circumference. Many such studies have been performed. Apart from the usual problem of sample selection these studies often include variable numbers of measurements from different fetuses. Most pregnant women have just a single ultrasound scan at about 15- 20 weeks of gestation. Repeat scans are usually performed only if there is some reason for clinical concern, such as apparently poor growth. Inclusion of such data will therefore bias the sample towards these fetuses, which will particularly affect data in the second half of pregnancy. A further problem is that data collected in this way are usually plotted and the line joining the means at each week of gestation is taken as the average 'growth curve'. The means do not, however, indicate average growth but average size; by definition we need measurements of each fetus on two or more occasions in order to study growth. We cannot make valid inferences about growth from single measurements of size; we cannot create a longitudinal study from cross- sectional data.

当我们考虑的是群体而非个体时,情况同理,且在涉及可能的因果关系时会出现更多问题。例如,多个国家对机动车事故死亡率在安全带立法前后的变化进行了比较。这类研究推断死亡率的任何下降都归因于安全带的引入,但两个时间段之间可能存在其他差异,如饮酒驾驶的减少。当考察多个时间段的数据时,这一问题更为明显。1950年至1984年的数据显示,平均每日监狱人口稳步上升,而精神病床位患者数量下降。这被解释为因果关系,即长期住院的精神病患者被释放后最终进入监狱(Weller和Weller,1986)。然而,任何两个随时间变化的量都可能显示统计关联,例如啤酒价格与牧师薪资(Gibbons和Davis,1984)或未婚母亲比例与剖宫产率。关联不等于因果;对此类数据需非常谨慎的统计分析。
The same applies when we consider populations rather than individuals, and further problems arise when we are concerned with a possible causal relation. For example, the change in the death rate from motoring accidents has been compared in several countries for the periods before and after the introduction of seat- belt legislation. The inference of such studies is that any reduction in the death rate is due to the introduction of seat- belts, but there may have been other differences between the two time periods, such as a reduction in drinking and driving. The problem is seen more clearly when data for many time periods are examined. Data from 1950 to 1984 show a steady rise in the average daily prison population and a fall in the number of patients in psychiatric beds. This was interpreted as a causal link, with discharged long- stay psychiatric patients ending up in prison (Weller and Weller, 1986). However, any two quantities changing over time will show a statistical association, such as the price of beer and the salaries of priests (Gibbons and Davis, 1984) or the proportion of

unmarried mothers and the rate of Caesarean section. Association is not necessarily causation; very careful statistical analysis of such data is required.

5.14 选择研究设计 5.14 CHOOSING A STUDY DESIGN

在实验研究与观察性研究之间的选择通常较为直接。如果在伦理和操作上都可行,进行实验研究是首选方法。特别是,对于替代治疗方案的评估,随机对照试验是最佳选择(参见第15章)。大多数研究并非实验研究。对1978-79年《新英格兰医学杂志》发表论文的回顾发现,在332篇原创文章中,只有90篇是对照实验(Bailar等,1984),而且这一比例在该杂志中可能偏高。其余大多数是观察性研究,其中大部分为横断面研究。前面章节讨论了病例对照、队列和横断面研究的优缺点,尤其是缺点。尽管如此,如果可行,前瞻性队列研究通常是最佳选择。
The choice between an experiment and an observational study is usually straightforward. If it is possible, both ethically and logistically, to carry out an experiment, then this is the preferred approach. In particular, the evaluation of alternative treatments is best addressed by a randomized controlled trial (see Chapter 15). Most studies are not experiments. A review of papers published in the New England Journal of Medicine in 1978- 79 found that only 90 of 332 original articles were controlled experiments (Bailar et al., 1984), and the proportion is probably unusually high in that journal. The majority of the remainder were observational studies, and most of those were cross- sectional studies. The previous sections have discussed the advantages and (especially) the disadvantages of case- control, cohort and cross- sectional studies. All have their weak points, although the prospective cohort study is usually the best bet if feasible.

观察性研究中可能存在大量偏倚,导致对同一现象的类似研究结果差异较大。这在关于高消费咖啡、啤酒、茶、甜味剂等与癌症风险增加之间的反复恐慌中表现明显。Feinstein(1988)认为,这种混乱很大程度上源于未能建立观察性流行病学研究的科学标准。Lichtenstein等(1987)提出了阅读病例对照研究报告的指导原则。
The large number of possible biases in observational studies can lead to considerable variation in the findings from similar studies of the same phenomenon. This is seen in the regular series of scares about an increased risk of cancer associated with high consumption of coffee, beer, tea, sweeteners, and so on. Feinstein (1988) argued that much of the confusion can be attributed to the failure to develop adequate scientific standards for observational epidemiological studies. Lichtenstein et al. (1987) gave guidelines for reading reports of case- control studies.

选择最合适的设计并不容易,因为需要权衡许多因素。强烈建议在规划阶段邀请统计学家参与。除了就设计选择提供建议外,统计学家还能在选择合适的研究对象样本方面提供宝贵帮助,这一问题在任何研究设计中都必须面对,但在观察性研究中尤为重要。统计学家还可以(且应当)就合适的样本量提供建议。第15章介绍了临床试验的样本量计算方法;观察性研究也有类似的方法。
The choice of the most appropriate design is not easy, as there are many considerations to weigh up. The involvement of a statistician at the planning stage is strongly recommended. As well as advising on the choice of design, they can give valuable assistance regarding the selection of suitable samples of individuals for study, a problem that must be confronted with any study design but is especially important in observational studies. The statistician can (and should) also advise on the appropriate sample size. Chapter 15 describes sample size calculations for clinical trials; similar methods are available for observational studies.

本章及后续章节反复强调,观察到的关联与因果推断之间存在巨大鸿沟。只有在随机试验和其他实验中,由于研究的控制性质,我们才能合理地将观察到的效应归因于因果关系。(但第15章也描述了临床试验中可能出现的一些问题。)在规划观察性研究时,重要的是要牢记将获得的信息及其用途。
A recurring theme in this and later chapters is the considerable gulf between an observed association and inference of a causal mechanism. Only in randomized trials and other experiments can we reasonably ascribe an observed effect to be causal, because of the controlled nature of the investigation. (But Chapter 15 describes some of the possible problems that can arise in clinical trials.) When planning an observational study it is important to bear in mind the information that will be obtained, and how

结果越容易解释越好。在观察性研究中,对观察到的关联进行解释需要非常谨慎。例如,瑞典征兵者的研究发现大麻消费与随后患精神分裂症之间存在强关联(Andréasson 等,1987)。然而,报告的作者非常谨慎地考虑了这种关系是否具有因果性。特别是,他们考虑过但谨慎地排除了大麻消费可能是由精神分裂症初现引起的可能性。第1.1节提到的关于寿命与左撇子的研究则是一个对比的例子(Halpern 和 Coren,1988)。虽然作者承认观察到的左撇子寿命略短不一定是因果关系,但他们没有考虑偏倚作为解释的可能性。他们的发现很可能是因为分析了死亡年龄,却忽略了仍然健在的人群,这些棒球运动员出生于一个左撇子比例因社会态度变化而上升的长时间段。因此,死亡的左撇子预期比死亡的右撇子年龄更小。(第13章描述了分析此类数据的正确方法。)
easily the results will be able to be interpreted. In observational studies the interpretation of observed associations needs great care. For example, the study of Swedish conscripts found a strong association between cannabis consumption and subsequent schizophrenia (Andréasson et al., 1987). The authors of the report were very careful, however, to consider whether the relation was causal or not. In particular they considered, but cautiously rejected, the possibility that cannabis consumption might be caused by emerging schizophrenia. The study of longevity and left- handedness referred to in section 1.1 is a contrasting example (Halpern and Coren, 1988). Although the authors acknowledged that the observed small reduction in longevity of left- handers is not necessarily causal, they did not consider the possibility of bias as an explanation. Their finding could well be explained by having analysed age at death, ignoring those still alive, for baseball players born over a long period during which the prevalence of left- handedness would have risen through changes in social attitudes. Those left- handers who died would thus be expected to have died younger than right- handers who had died. (The correct way to analyse this type of data is described in Chapter 13. )

本章介绍了研究设计中的各种问题,但远非全面。关于设计问题的更详细讨论可参见Gehlbach(1982),临床试验见Pocock(1983),病例对照研究见Breslow和Day(1980)或Schlesselman(1982),队列研究见Breslow和Day(1987)。
This chapter has introduced various issues in the design of research, but is by no means comprehensive. Lengthier discussion can be found in Gehlbach (1982) for a general discussion of design issues, Pocock (1983) for clinical trials, Breslow and Day (1980) or Schlesselman (1982) for case- control studies, and Breslow and Day (1987) for cohort studies.

练习 EXERCISES

【5】1 1978-79年,对洛锡安地区(爱丁堡周边)1007名居民(608名男性和399名女性)进行了随机抽样调查,询问他们过去七天内具体饮用了哪些酒精饮料。1981年3月,税收和酿酒商价格的双重上涨导致酒精饮料价格首次在30多年内超过零售价格指数的涨幅。因此,在1981年秋季,对676名受访者(484名男性和192名女性)进行了再次访谈,这些人是在最初调查的七天内至少饮用过一次酒精饮料的“常规饮酒者”。
5.1 In 1978- 79 a random sample of 1007 residents (608 men and 399 women) of the Lothian region (around Edinburgh) had been asked precisely what alcohol they had drunk in the previous seven days. In March 1981 the combination of an increase in taxation and brewers' prices meant that, for the first time in over 30 years, the price of alcoholic beverages increased faster than the retail price index. So in the autumn of 1981 the 676 respondents (484 men and 192 women) who had had at least one alcoholic drink in the seven days on which the original survey had been based - the so- called 'regular drinkers' - were reinterviewed.

第一次调查在1978年7月至1979年2月间进行,第二次调查在1981年9月至1982年3月间进行。在这三年间,酒精饮料价格上涨了61%,而零售价格指数上涨了52%。平均收入(及可支配收入)涨幅超过零售价格指数,表明常规就业者的经济状况略有改善。然而,爱丁堡地区1978至1982年间男女失业率均大幅上升。
The first survey was carried out between July 1978 and February 1979 and the second between September 1981 and March 1982. Over the three years, the cost of alcoholic beverages had risen by while the retail price index had risen by . Average earnings (and disposable income) had risen more than the retail price index, suggesting that those in regular employment were marginally better off than in 1981. Unemployment in the Edinburgh area, however, had risen steeply between 1978 and 1982 for both men and women.

第二次调查结果报告如下:
The results of the second survey were reported as follows:

“在最初的676名常规饮酒者中,成功访谈了463人(69%)。未能访谈的213人中,85人无法联系,48人已知离开该地区,39人拒绝参与,23人已死亡或病重无法访谈。失访者中,30岁以下、未婚及无固定职业者比例过高。尽管如此,重新访谈的463人中男女比例及第一次调查时的男女酒精消费情况仍具有代表性。”(Kendell 等,1983)
'Of the original 676 regular drinkers, 463 were successfully interviewed. Of the 213 who were not, 85 could not be traced, 48 were known to have left the region, 39 refused, and 23 were either dead or too ill to be interviewed. A disproportionate number of lost respondents were under the age of 30, unmarried, and not in regular employment. Nevertheless, the sex ratio and both male and female alcohol consumption at the time of the first survey of the 463 who were reinterviewed were representative of the original sample.' (Kendell et al., 1983).

(a) 作者关注酒精摄入量的减少,因此未对第一次调查中未报告饮酒的受试者进行访谈。这合理吗?
(a) The authors were interested in reduction in alcohol intake, and so did not interview those subjects not reporting drinking in the first survey. Is this reasonable?
(b) 第二次调查的响应率是多少?未响应者可能与响应者有何不同?这对调查结果的解释可能产生什么影响?
(b) What was the response rate to the second survey? How might non-respondents differ from respondents? What is the likely effect on the interpretation of the results of the survey?
(c) 两次调查未在完全相同的季节进行,这重要吗?
(c) Does it matter that the two surveys were not carried out at exactly the same time of year?
(d) 如果数据显示在463名再次访谈的受试者中酒精消费量有所减少,作者是否可以合理地得出这是由于酒精消费税的提高所致?
(d) If the data showed a reduction in alcohol consumption among the 463 reinterviewed subjects, could the authors reasonably conclude that it was due to the rise in excise duty on alcohol?

论文的讨论部分开始于:
The Discussion of the paper begins:

“这项前后对比调查的核心发现是,洛锡安地区463名定期饮酒者在1978-79年至1981-82年间减少了18%的酒精消费量,同时经历了16%的不良影响减少。这种消费下降的主要原因很可能是酒精饮料价格相对于生活成本和平均收入在这三年期间的上涨。”
'The central finding of this before and after survey is that a representative population of 463 regular drinkers in the Lothian region reduced their alcohol consumption by between 1978- 9 and 1981- 2 and simultaneously experienced a reduction in adverse effects. The main cause of this fall in consumption was probably the rising cost of alcoholic beverage relative to the cost of living and average incomes during that three year period.'

(e) 这463名“定期饮酒者”真的构成了“具有代表性的人群”吗?
(e) Were the 463 'regular drinkers' really a 'representative population'?
(f) 请评论作者对结果的解释。如果他们在第二次调查中采访了全部1007名受试者,你的看法会有所不同吗?
(f) Comment on the authors' interpretation of the results. Would your opinion be different if they had interviewed all 1007 subjects in the second survey?

在最后一段,作者写道:
In the final paragraph the authors wrote:

“因此,这项研究的结果表明,提高酒精饮料的消费税可以成为减少过度饮酒不良影响的有效手段。”
'The findings of this study indicate, therefore, that an increase in excise duty on alcoholic beverages can be an effective means of reducing the ill effects of excessive alcohol consumption.'

(g) 这些结论是否具有任何有效性?
(g) Do these conclusions have any validity?

5.2 一位研究者希望了解服用口服避孕药的女性是否比其他女性更早或更晚进入更年期。他决定研究一组1930年出生的女性,因为这组女性年龄既足够年轻,有些可能服用了避孕药,也足够年长,有些可能已进入更年期。他通过诊所的年龄-性别登记册获得了该诊所所有1930年出生的132名女性的名单。声称已进入更年期的女性通过测量促卵泡激素(FSH)水平进行确认。在这132名女性中,有101名因以下原因被排除在研究之外:
5.2 A researcher wished to see if women who have taken the oral contraceptive pill have an earlier or later menopause than other women. He decided to study a group of women born in 1930 as these would be young enough for some to have taken the pill but old enough for some to have reached the menopause. He obtained the names of all 132 women in one general practice who were born in 1930, using the practice's age- sex register. Women claiming to have had the menopause were checked by measuring their follicle stimulating hormone (FSH) levels.Of the 132 women, 101 were excluded from the study for the following reasons:

在这132名女性中,有101名因以下原因被排除在研究之外:
Of the 132 women, 101 were excluded from the study for the following reasons:

22人未获得资料(21人无法联系,1人拒绝)60人处于绝经前期,14人做过子宫切除术,1人因放射治疗导致绝经,2人未婚,2人
22 not available (21 not contactable, 1 refusal) 60 premenopausal 14 hysterectomy 1 radium- induced menopause 2 unmarried 2

(a) 这项研究的设计是什么?
(a) What was the design of this study?

(b) 这31名女性的样本是否具有代表性?
(b) Is the sample of 31 women representative of the population of interest?

研究者发现31名女性中有12人曾服用口服避孕药,19人未服用。他获得了两组绝经年龄的以下结果,并得出结论:服用避孕药不会延迟绝经。
The researcher found that 12 of the 31 women had taken the oral contraceptive pill at some time, while 19 had not. He obtained the following results relating to age at menopause in the two groups, and concluded that taking the pill does not delay the menopause:

样本数绝经年龄(岁)
平均值标准差
服用避孕药者1247.22.1
未服用避孕药者1947.52.1
nAge at menopause (years)
MeanSD
Pill users1247.22.1
Non pill users1947.52.1

(c) 这项研究设计中的根本错误是什么?
(c) What was the fundamental error in the design of this study?
(d) 需要什么样的设计才能回答最初提出的问题?
(d) What design is needed to answer the question originally posed?

(本练习基于Davis 1985年对一项有缺陷研究项目的坦诚描述。)
(This exercise is based on a frank account of a flawed research project by Davis, 1985. )

【5】3 Halpern和Coren(1988)希望探究左撇子和右撇子在寿命上的差异。关于个体惯用手的信息极少,其中一个来源是棒球百科全书。他们从百科全书中记录了1472名右撇子和236名左撇子球员的出生和死亡日期。
5.3 Halpern and Coren (1988) wished to see if there was a difference in longevity between left- handed and right- handed people. One of the few sources of handedness of individuals is a baseball encyclopaedia. From an encyclopaedia they recorded the dates of birth and death of 1472 right- handed and 236 left- handed players.

(a) 这是一个具有代表性的人群样本吗?
(a) Was this a representative sample of the population?

(b) 作者没有说明数据的时间跨度,但由于他们记录了直到99岁死亡的情况,数据很可能涵盖了整个二十世纪。如此长的时间跨度会如何影响左右手使用者的比较偏倚?
(b) The authors did not state the time span of the data, but as they note deaths up to age 99 it is likely to cover the whole of the twentieth century. How might the long time span bias the comparison of left- and right-handers?

(c) 他们比较了两组的平均死亡年龄。为什么这种比较具有误导性?
(c) They compared the mean age at death in the two groups. Why is this a misleading comparison?

(d) 假设有更广泛的左右手使用数据,什么样的设计更适合回答这个问题?
(d) What would be a better design to answer this question, assuming that handedness data were more widely available?

6 使用计算机 6 Using a computer

好消息是统计分析变得更简单、更便宜。坏消息是统计分析变得更简单、更便宜。
The good news is that statistical analysis is becoming easier and cheaper. The bad news is that statistical analysis is becoming easier and cheaper.

Hofacker (1983)
Hofacker (1983)

6.1 引言 6.1 INTRODUCTION

最近的技术进步使许多医学研究人员能够使用计算机。这一变化总体上是有益的,但应牢记Hofacker上述的话。计算机消除了统计分析中大部分繁琐的部分,并且应当能给出正确的答案,但它们并不能保证我们获得正确且有效的结果。本章将讨论使用计算机进行统计分析的优缺点,并提出数据分析的方法。我还将考虑用于计算机分析的数据收集表格设计。
Recent technological advances have provided many medical researchers with access to a computer. This change has largely been beneficial, but Hofacker's words above should be borne in mind. Computers remove most of the tedious aspects of statistical analysis, and should give us correct answers, but they do not guarantee that we will obtain correct and valid results. In this chapter I shall consider the advantages and disadvantages of using computers for statistical analysis, and suggest ways to approach the analysis of data. I shall also consider the design of forms for collecting data to be analysed by computer.

6.2 使用计算机的优点 6.2 ADVANTAGES OF USING A COMPUTER

使用计算机进行统计分析有许多优点。最显而易见的是,它使我们能够完成否则无法完成的任务,但还有许多其他好处:
There are many advantages in using a computer to carry out statistical analyses. Most obviously it enables us to do things we couldn't otherwise do, but there are many other benefits:

(a) 准确性和速度 (a) Accuracy and speed

优秀的计算机程序(即软件)能快速给出正确答案。手工分析容易出现算术错误,且除最简单任务外速度极慢。
Good computer programs (known as software) will give the correct answers quickly. Analysis by hand is prone to arithmetical errors, and is painfully slow for all but the simplest tasks.

(b) 多功能性 (b) Versatility

计算机可以访问广泛的统计技术,远超过本书中描述的内容。即使是复杂的分析也能快速完成。
A computer gives access to a wide range of statistical techniques, many more than are described in this book. Even complex analyses can be performed quickly.

(c) 图形 (c) Graphics

(c) 图形 计算机程序可以轻松生成观察数据或统计结果的图形。应充分利用这一功能。直方图和散点图可用于检查原始数据(见第7章),图形也可用于研究分析结果。第6.8节讨论了与计算机绘图相关的一些实际问题。
(c) GraphicsComputer programs enable plots of observations or statistical results to be obtained easily. Full advantage should be taken of this facility. Histograms and scatter diagrams can be used to inspect the raw data (see Chapter 7), and plots can also be used to study the results from an analysis. Section 6.8 discusses some practical issues relating to computer plots.

(d) 灵活性 (d) Flexibility

(d) 灵活性 一个主要优点是能够进行小幅修改并重复分析。例如,经过数据变换后重新运行分析很简单,比如取对数(见第7章);也可以对数据子集执行相同分析,或添加一些新观察值。
(d) FlexibilityA major advantage is the ability to make small changes and repeat the analysis. For example, it is simple to rerun an analysis after transforming the data, perhaps by taking logs (see Chapter 7), to perform the same analysis on a subset of the data, or to add some new observations.

(e) 新变量 (e) New variables

(e) 新变量 生成新变量非常简单。我们可以根据出生日期和研究日期计算受试者年龄,或者通过计算治疗前后测量值的差异来得出变化量,或统计患者症状的数量。此类计算应始终由计算机完成,因为计算机比手工计算更快、更准确。当然,如果创建新变量的指令错误或输入有误,所有观察值都会出错。
(e) New variablesIt is simple to generate new variables. We may calculate a subject's age from their date of birth and the date of the study, or the change in a measurement by taking the difference between pre- and post-treatment values, or count the number of symptoms a patient has. Such calculations should always be done on the computer, which is faster and more accurate than doing the calculations by hand. Of course, if the instruction to create a new variable is incorrect or is typed wrongly all of the observations will be wrong.

(f) 数据量 (f) Volume of data

(f) 数据量 可以处理海量数据。实际上,对于某些程序,分析的受试者(病例)数量没有限制。
(V) Volume of dataVast amounts of data can be handled. Indeed for some programs there is no limit to the number of subjects (cases) that can be analysed.

(g) 数据轻松传输 (g) Easy transfer of data

(g) 数据易于传输 一旦数据被输入到计算机文件中,就可以通过电子方式(如电话线)或邮寄“软盘”的方式轻松地在研究人员之间传输。通常不应重复输入相同的数据,但遗憾的是,不同计算机使用的磁盘格式和大小各异。
(g) Easy transfer of dataOnce data have been entered into a computer file they can easily be transferred between researchers either electronically (by telephone line) or by sending a 'floppy disk' by post. It should never be necessary to enter the same data into a computer twice, but unfortunately computers use a variety of disk formats and sizes.

6.3 使用计算机的缺点 6.3 DISADVANTAGES OF USING A COMPUTER

为了平衡主要的优势,统计软件用户应注意几个潜在的问题。
To counterbalance the major benefits there are several potential problems that users of statistical software should be aware of.

(a) 软件错误 (a) Errors in software

(a) 软件错误 并非所有统计程序都编写得很好。有些程序在某些情况下可能会给出错误的结果,这可能是由于编程不良或对统计理论理解不足。建议使用声誉良好且存在时间较长、已被检验过的程序。
(a) Errors in softwareNot all statistical programs are well-written. Some may give incorrect answers in certain circumstances, either through poor programming or inadequate understanding of the statistical theory. It is advisable to use programs that are reputable and have been around long enough for errors

可以找到许多统计软件,其中最著名的有 BMDP、Minitab、SAS 和 SPSS。自微型计算机(PC)问世以来,市场上的统计程序数量大幅增加,其中有些质量较差,甚至存在错误(Bland 和 Altman,1988;Dallal,1988)。第6.4节和6.5节提供了关于选择和评估统计软件的建议。
to be found, the best known of which are BMDP, Minitab, SAS and SPSS. Since the advent of microcomputers (PCs) there has been a huge increase in the number of statistical programs on the market, some of which are poor and some incorrect (Bland and Altman, 1988; Dallal, 1988). Sections 6.4 and 6.5 give advice about choosing and evaluating statistical software.

(b) 多功能性 (b) Versatility

多功能性被认为是使用计算机的优点之一,但也可能带来困难。由于可用分析方法种类繁多,容易使用不恰当的方法。必须清楚自己统计知识的局限,只使用自己理解的方法。如果遇到需要不熟悉方法的问题,应寻求专家建议。
Versatility was given as one of the advantages of using a computer, but it can lead to difficulties too. Because of the wide variety of analyses available, it is easy to use an inappropriate method. It is essential to be aware of the limits of your statistical knowledge, and to use only methods that you understand. If your problem seems to require methods you are not familiar with you should seek expert advice.

(c) 黑箱方法 (c) The black box approach

使用计算机可能使你与数据产生距离感。统计分析可以自动完成:数据从一端输入,结果从另一端输出,完全不经过人工思考。由于许多统计分析关注的是平均效应,你可能无法感受到个体的具体反应方式。
Using a computer may distance you from your data. It is possible to perform statistical analyses automatically: the data go in at one end and the results come out at the other, untouched by human thought. Because much statistical analysis is concerned with average effects you may get no feel for the way individuals respond.

(d) 垃圾进,垃圾出 (d) Garbage in garbage out

“垃圾进,垃圾出”指的是只有合理的问题才能得到合理的答案。如果输入的数据或分析的设定错误,那么结果也会错误。例如,一个常见的问题是如何处理缺失的观测值。当数据输入计算机时,这些值有时会留空,这种情况下该值会自动被视为零,或者它们会被赋予一个数值“缺失代码”,如99。常用的缺失值代码有9、99、999等,或者是负数—只要明确该值不可能是真实观测值即可。
'Garbage in garbage out' refers to the fact that sensible answers follow only from sensible questions. If the data input or the specification of the analysis was wrong then the results will be wrong. For example, a common problem is what to do about missing observations. When data are entered into the computer such values are sometimes left blank, in which case the value will automatically be taken as zero, or they are given a numerical 'missing value code', such as 99. It is common to use values like 9, 99, 999, etc. as missing values, or perhaps a negative number - any value will do as long as it clearly could not be a genuine observation.

表3.1显示了25名囊性纤维化患者的PImax值;均值和标准差分别为92.6和24.92 cm 。假设研究中还有另外5名患者的PImax未知。如果这些值留空(视为零)或编码为999,并将所有30个值用计算机程序分析,结果将如下:
Table 3.1 showed the PImax values of 25 patients with cystic fibrosis; the mean and standard deviation were 92.6 and 24.92 cm respectively. Suppose that there had been five other patients in the study whose PImax was unknown. If their values were left blank (zero) or coded 999 and all 30 values analysed by a computer program then the results would have been as follows:

缺失数据的取值30名受试者的结果
均值标准差
077.241.79
999243.7344.32
Value for missing dataResult for 30 subjects
MeanSD
077.241.79
999243.7344.32

这两种结果都严重扭曲了事实。计算机会将0或999视为真实观测值,因此会给出错误的答案。缺失数据必须在程序中明确标识(参见第6.6节)。
both of which are major distortions of the truth. The computer will accept the values 0 or 999 as genuine observations, and so will give false answers. Missing data must be identified as such to the program (see section 6.6).

当信息对某些个体不适用而非真正缺失时,也可能出现类似问题。例如,怀孕次数只适用于女性,在研究中男性可能被记录为9或99。这些例子说明了在分析前检查数据的重要性,详见第6.6节和下一章。
A similar problem may arise when information is not appropriate for some individuals rather than actually missing. For example, the number of pregnancies is only appropriate for women, and may be recorded as 9 or 99 for all males in a study. These examples show the importance of checking the data before analysis, as discussed in section 6.6 and in the next chapter.

6.4 统计程序的类型 6.4 TYPES OF STATISTICAL PROGRAM

市售统计软件通常能够执行多种统计分析。统计程序,通常称为软件包,其功能和工作方式各异。需要考虑的一些重要方面包括:
Commercially available statistical software is generally capable of perform­ ing many types of statistical analysis. Statistical programs, often called packages, vary in their capability and the way in which they work. Some of the more important aspects to consider are:

  1. 可用的统计方法
  2. statistical methods available
  3. 准确性
  4. accuracy
  5. 可分析的数据最大容量
  6. maximum amount of data that can be analysed
  7. 数据处理功能(包括编辑)
  8. facilities for data manipulation (including editing)
  9. 处理缺失数据的能力
  10. ability to accept missing data
  11. 易用性(是否“用户友好”?)
  12. ease of use (is it 'user-friendly'?)
  13. 成熟度(是否经过验证和测试?)
  14. maturity (is it tried and tested?)
  15. 速度
  16. speed
  17. 文档支持
  18. documentation
  19. 错误处理
  20. error handling
  21. 图形功能
  22. graphics capability
  23. 输出质量
  24. quality of output
  25. 成本。
  26. cost.

最重要的考虑因素是上述列表中的前两个,因为显然你需要一个能够执行所需分析并获得正确结果的软件包。然而,评估准确性并不容易。其他关键问题包括能否简单地创建图表,以及在你犯错时(这很常见)是否有有用的错误提示。此外,告诉程序你想要完成什么任务的方法也各不相同。在某些软件包中,需要输入诸如
The most important considerations are the first two in the above list. because you obviously need a package that will perform the analyses desired and achieve correct results. However, assessing accuracy is not easy. Other key issues are the ability to create plots simply, and helpful error messages when you make a mistake, as you often will. In addition. there are different ways of telling the program what you want done. In some packages one enters commands such as

绘制身高与年龄图
plot height age

但在其他情况下,人们从选项菜单中选择。这被称为交互式系统。对于使用命令的程序,通常可以创建一个命令文件,然后作为一个整体执行。这种方法的优点是复杂的指令只需编写一次
but in others one chooses from a menu of options. This is known as an interactive system. For programs that use commands it is usually possible to create a file of commands which can then be executed as a block. This has the advantage that possibly complicated instructions only have to be

只需输入一次,且编辑文件以生成略有不同的分析非常容易。
typed once, and that it is easy to edit the file to produce slightly different analyses.

除了涵盖广泛分析的统计软件包外,还有一些针对特定用途的专用程序,如计算样本量或置信区间。这些程序也需满足上述某些要求,但主要应根据其完成常规软件包无法实现的特定功能的能力来评判。
As well as statistical packages, which cover a wide range of analyses, there are also some specialized programs for particular purposes, such as calculating sample sizes or confidence intervals. These are subject to some of the above requirements too, but should be judged mainly on their specific ability to do things that cannot be done in the usual packages.

在选择使用或购买软件包之前,值得向同事或统计学家寻求建议。我强烈建议你在所有分析中使用同一软件包,因为熟练掌握哪怕一个软件包都需要相当的努力。因此,谨慎选择软件非常重要。本书中涉及的所有分析几乎没有哪个软件包能全部完成,因此你需要了解可能进行的所有分析类型,而这并不容易。市面上有许多微型计算机统计程序(甚至免费软件)可能会产生错误结果(Dallal,1988),如果对某个统计程序有疑问,建议将其输出结果与其他程序进行比较。
It is worth seeking advice from a colleague or from a statistician before choosing a package to use, or buy. I strongly recommend that you use the same package for all your analyses, as it takes a considerable effort to become fully acquainted with even one package. So it is important to choose your software carefully. Few, if any, packages will perform all the analyses in this book, so that it is necessary to know all the types of analysis you might wish to do, which is not at all easy. There are many microcomputer statistics programs on the market (or even free) that can give incorrect results (Dallal, 1988), so if you have doubts about a particular statistical program it is advisable to compare its output with that from another.

下一节将讨论评估统计软件的一些方面。如果你确定拥有可靠的软件,可以直接跳到6.6节,那里描述了一般的分析策略。
The next section discusses some aspects of evaluating statistical software. If you know that you have reliable software then you can go on to section 6.6, which describes a general strategy for analysis.

6.5 评估统计软件包 6.5 EVALUATING A STATISTICAL PACKAGE

(本节可省略,不影响连贯性。)
(This section can be omitted without loss of continuity.)

评估统计计算机程序时的主要关注点是:
The main concerns when evaluating a statistical computer program are:

【1】 它是否执行所有期望的功能?

  1. Does it perform all the desired functions?

【2】 它是否易于使用?
2. Is it easy to use?

【3】 它是否给出正确的答案?
3. Does it give the correct answers?

同事或统计学家的建议在回答前两个问题时非常有帮助,因为只有熟悉一个软件包后,才能真正判断其价值和易用性。第6.4节中给出的功能列表可以辅助评估。本节的目的是在上述第(3)点上提供有限的帮助。
Advice from colleagues or from a statistician can be of great assistance in answering the first two questions, because it takes some familiarity with a package before one can really judge its value and ease of use. The list of features given in section 6.4 can aid evaluation. The purpose of this section is to give limited assistance in relation to (3. ) above.

计算机程序可能给出错误答案,原因要么是使用了错误的公式,要么是程序编写不良。前者虽不太可能,但仍有可能。更常见的问题是程序编写方式导致的。计算机程序执行特定计算的过程称为算法。有些算法在某些情况下会丧失精度,属于较差的算法。举一个简单的例子,可以证明三个数 的标准差是 ,无论 取何值。
A computer program may give the wrong answers either because it uses an incorrect formula or because it is not well written. The former is unlikely but possible. More often problems occur because of the way in which the program was written. The procedure by which a computer program performs a given calculation is known as an algorithm. Some algorithms are inferior in that they lose accuracy in some circumstances. To take a simple example, it can be shown that the standard deviation of three numbers , and is , whatever values we give and . I

我使用两台袖珍计算器计算了多组三个数的标准差,其中 增加,但 固定为0.1。对于以下四组数字:
calculated the standard deviations of sets of three numbers where increases but is held at 0.1 using two pocket calculators. For each of the four sets of numbers

(a) 7.0 7.1 7.2
(a) 7.0 7.1 7.2

(b) 77.0 77.1 77.2
(b) 77.0 77.1 77.2

(c) 777.0 777.1 777.2
(c) 777.0 777.1 777.2

(d) 7777.0 7777.1 7777.2
(d) 7777.0 7777.1 7777.2

两者都给出了正确的答案0.1,但对于该组数据
both gave the correct answer of 0.1, but for the set

(e) 77777.0 77777.1 77777.2
(e) 77777.0 77777.1 77777.2

一个计算器给出的标准差是0.0,而另一个则显示错误—它无法计算标准差。问题的原因在于,在极端情况下,计算器会因为无法存储数据平方后得到的大数而丧失精度。有些算法可以避免这个问题,虽然我们不指望口袋计算器使用这些算法,但我们肯定期望计算机程序能给出这类数据的正确答案。然而,许多微型计算机软件包仍采用较差的算法(Dallal,1988)。在某些复杂分析中也存在数值精度丢失的风险;关于回归分析的问题将在第12章讨论。
one calculator gave the standard deviation as 0.0 while the other gave an error - it would not calculate the standard deviation. The reason for this problem is that in extreme circumstances the calculator loses accuracy because it cannot store the large numbers obtained when the data are squared. There are algorithms that avoid this problem, and while we may not expect them to be used on a pocket calculator we would certainly expect a computer program to give the correct answers for such data. However, many microcomputer packages use the inferior algorithm (Dallal, 1988). There is also a risk of losing numerical accuracy in some complex analyses; problems with regression analysis are discussed in Chapter 12.

电子表格软件在执行简单统计分析方面的使用日益增加。这些程序不太适合统计分析,可能使用较差或错误的方法。我不推荐它们用于严肃的统计工作。
Increasing use is being made of spreadsheet software for performing simple statistical analyses. These programs are not well suited to statistics, and may use inferior or incorrect methods. I do not recommend them for serious statistics.

对于某些类型的分析,还需考虑使用哪种形式的检验,尤其当存在多种形式时。后续章节将讨论单边和双边检验、连续性校正的使用以及秩次分析(非参数方法)中对平局的调整等问题。了解程序具体采用的方法非常重要,而这往往在手册中并不明确。实际上,有些手册甚至具有误导性(Bland 和 Altman,1988)。
Another aspect to consider for some types of analysis is which form of a test is used when there are different forms available. Subsequent chapters will discuss matters such as one and two- sided tests, the use of continuity corrections, and the adjustment for ties in analyses of ranks (non- paramet­ ric methods). It is important to know precisely what method the program uses, and this is not always clear from the manual. Indeed some manuals are positively misleading (Bland and Altman, 1988).

首次使用软件包进行特定分析时,建议先分析一些已知答案的数据集。本书尽可能提供了示例的原始数据,方便读者进行此类练习。
When using a package to perform a particular analysis for the first time it is advisable to begin by analysing some sets of data for which you already know the answers. In this book the raw data are given for the worked examples wherever possible to enable you to do this.

6.6 计算机辅助分析策略 6.6 STRATEGY FOR COMPUTER-AIDED ANALYSIS

本节包含在计算机上分析数据的总体策略。请注意,在进入数据分析之前,需要经过若干步骤。
This section contains a broad strategy for analysing data on a computer. Notice that there are several steps to pass through before moving to the analysis of the data.

(a) 数据收集 (a) Data collection

(a) 数据收集
第6.7节描述了为将要输入计算机的数据准备编码表的几个方面。如果有一个设计良好的编码表,数据录入将更快且更准确。
(a) Data collectionSection 6.7 describes several aspects of preparing a coding sheet for data that are going to be typed into a computer. Data entry will be much quicker and more accurate if there is a well-designed coding form to work from.

(b) 数据录入 (b) Data entry

(b) 数据录入
数据应输入到计算机中的文件中。这可以在统计软件包内完成,也可以使用通用编辑程序。存储数据的原因是你常常需要在以后进行进一步分析,而且只需输入一次数据。同时,列出数据并检查数值是否正确录入也很方便。第6.7节讨论了数据文件的格式。无法读取文件中数据的统计软件包应被淘汰。
(b) Data entryData should be typed into a file on the computer. This may be possible within the statistics package or using a general purpose editing program. The reason for storing the data is that you will often need to carry out further analyses at a later date, and you only want to enter the data once. Also it is easy to list the data and check that the values have been entered correctly. I consider formats for data files in section 6.7. A statistical package that cannot read data from a file should be rejected.

(c) 数据检查 (c) Data checking

(c) 数据检查
人们往往认为数据一旦输入计算机就是正确的。实际上,无论多么小心,输入(键入)数据时都很容易出错。无论多么繁琐,必须检查数据是否正确录入。减少错误的最好方法是让两个人分别录入数据两遍。此时,使用文件比较程序非常有用。检查两个文件间的差异,并确定正确的数值。第7.2节讨论了数据检查。
(c) Data checkingThere is a tendency to believe that once the data are on the computer they must be correct. In fact it is all too easy to make errors when entering (typing) data, however careful one is. It is essential to check that the data have been typed correctly, however tedious this may be. The best way to minimize errors is to have the data entered twice, preferably by two different people. Here it is useful to have a program for comparing files. Any differences between the two files are checked and the correct value obtained. Data checking is discussed in section 7.2.

(d) 数据筛选 (d) Data screening

(d) 数据筛选
在开始主要统计分析之前,重要的是先观察数据。制作每个变量的直方图是简单的任务,变量对之间可以通过散点图进行检查。这些图表能初步展示平均值、变异性、分布形态,以及是否存在异常值或缺失值。第7.5节讨论了数据筛选。
(d) Data screeningBefore starting the main statistical analysis it is important to look at the data. It is a simple task to produce a histogram of each variable, and pairs of variables can be inspected by scatter diagrams. These plots will give a first idea of the average value, the variability, the shape of the distribution, and whether there are any outlying or missing values. Data screening is discussed in section 7.5.

(e) 数据分析 (e) Data analysis

适当的统计分析形式通常直接来源于研究设计。特别是,变量的数值可能在组间或组内进行比较,如第5章所述。组内比较必须使用针对该类型数据设计的技术。
The appropriate form of statistical analysis will often follow directly from the design. In particular, values of a variable may be compared between groups or within a group, as discussed in Chapter 5. Within group comparisons must make use of techniques intended for that type of data.

研究目标应指明几个主要感兴趣的分析。虽然预先指定的分析最为重要,但对数据的观察可能提示一些额外的有趣分析。这些“探索性”分析的结果应谨慎解读(见第8章)。
The objectives of the study should indicate a few main analyses of interest. Although the pre- specified analyses are the most important ones, inspection of the data may suggest some additional analyses of interest. The results of these 'exploratory' analyses should be interpreted cautiously (see Chapter 8).

许多统计方法基于对数据的某些假设。
Many statistical methods are based on certain assumptions about the

这些假设可能需要通过进一步分析来验证。
data. These may require further analyses to verify them.

我强烈建议如果软件支持,保持一个计算会话的“日志”,其中同时显示输入命令和结果。尤其当命令未保存到文件时,这一点尤为重要。
I strongly recommend that you keep a 'log' of the computing session if the software has the facility, in which both the input commands and results are shown. This is especially important when the commands are not stored on a file.

(f) 结果检查 (f) Checking results

(f) 结果检查 你应检查结果是否对应正确的观测数量—无意中丢失或多出几个病例是很容易的。重要的是要认识到,计算机给出的结果不应被自动视为正确。对数据进行简单的初步检查应能让你对预期结果有所了解。如果结果与预期明显不同,则应检查数据是否有错误,以及是否进行了正确的分析。分析数据时容易出错,尤其是复杂分析。计算机只有在你提出正确问题时才会给出正确答案。显然,如前所述,有分析日志时检查结果会容易得多。
(f) Checking resultsYou should check that the results relate to the correct number of observations – it is surprisingly easy to lose or gain a few cases unwittingly. It is important to appreciate that the results obtained from a computer should not be taken as automatically correct. Simple preliminary inspection of the data should give you some idea of what results to expect. If the results obtained differ markedly from expectation, then you should check that there are no errors in the data, and that the proper analysis has been performed. It is easy to make mistakes when trying to analyse data, especially with complex analyses. The computer will give you the correct answer only if you ask the correct question. Clearly it is much easier to check results when there is a log of the analysis, as suggested above.

(g) 结果解释 (g) Interpretation

(g) 结果解释 结果的解释将在后续章节讨论。
(g) InterpretationInterpretation of results is discussed in subsequent chapters.

6.7 数据收集表格 6.7 FORMS FOR DATA COLLECTION

当数据将用于计算机后续分析时,使用带有每位数字指定框的标准表格是个好主意。这适用于从现有记录(如医院病历)中提取数据的研究,也适用于前瞻性研究。尤其当每个个体收集许多变量信息时,这一点非常重要。
When data are to be collected for subsequent analysis using a computer, it is a good idea to use a standard form with assigned boxes for each digit. This applies to studies where data are to be extracted from existing records, such as hospital notes, as well as to prospective studies. It is especially important when information on many variables is collected for each individual.

我将先考虑计算机程序从文件接收(读取)数据的不同方式,然后讨论表格设计的相关方面。关于表格设计的进一步讨论见Pocock(1983,第160-166页)、De Pauw和Buyse(1984)(特别针对癌症试验)以及Armitage和Berry(1987,第8-14页)。
I shall first consider alternative ways in which computer programs can accept (read) data from a file, and then aspects of form design. Further discussion of form design is given by Pocock (1983, pp. 160- 6), De Pauw and Buyse (1984) (with special reference to cancer trials) and Armitage and Berry (1987, pp. 8- 14).

6.7.1 计算机程序输入格式 6.7.1 Formats for input to computer programs

6.7.1 计算机程序输入格式 大多数软件包能读取的一种标准格式(称为自由格式)如图6.1所示,该图展示了一项比较两种降压药的试验中部分数据。这里文件的每一行包含一个个体的多个变量,每个信息项之间用一个或多个空格分隔。列不必像示例中那样垂直对齐,但我建议这样做,
6.7.1 Formats for input to computer programsA standard format (called free format) that most packages will read is shown in Figure 6.1 which illustrates the first part of the data from a trial comparing two antihypertensive drugs. Here each row of the file contains several variables for one individual and each item of information is separated from the next by one or more spaces. There is no necessity for the columns to line up vertically as in this example, but I recommend this

001 17 02 89 25 11 33 1 2 170.2 77.1 141 82 129 79 002 21 02 89 02 02 44 1 1 162.3 80.8 150 85 144 81 003 28 02 89 14 06 40 2 2 151.9 72.2 142 79 142 76 004 05 03 89 01 12 28 1 1 178.8 91.4 181 101 155 87 005 11 03 89 18 05 48 1 1 166.0 81.8 170 90 158 84 006 12 03 89 24 09 37 2 1 171.4 73.3 139 82 134 78 007 17 03 89 07 04 36 2 2 155.8 61.5 184 107 177 102 008 20 03 89 12 02 38 1 2 185.2 100.6 157 93 150 88

等等
etc

图6.1 一种普遍适用的数据布局样式示例,用于输入统计计算机程序。不同项目由一个或多个空格分隔,每列数字对齐。各列依次包含患者编号、入组日期、出生日期、性别、治疗方案、身高、体重、初始血压(收缩压和舒张压)及最终血压。
Figure 6.1 An example of a generally applicable style of data layout for entry into a statistical computer program. Different items are separated by one or more spaces and the figures in each column are aligned. The columns contain, in sequence, patient number, date of entry to study, date of birth, sex, treatment, height, weight, initial blood pressure (systolic and diastolic) and final blood pressure.

这种做法便于目视检查所有信息是否正确录入。我强烈建议使用代码编号来标识每位受试者,如图6.1所示。这便于核查可疑值、后续添加变量、确认无重复受试者等。
practice as it makes it easy to check visually that all information has been entered correctly. I strongly recommend that a code number is used to identify each subject, as in Figure 6.1. This makes it easy to check any suspicious values, to add extra variables at a later date, to check that nobody is in the study twice, and so on.

自由格式的替代方案是固定格式,数据文件中项目间不需空格分隔。此格式的缺点是必须告知程序具体格式,若不熟悉编程则较复杂。固定格式可用空白表示缺失数据,但这并不理想,因为空白无法区分是故意缺失还是疏漏。此外,大多数程序会将空白解释为零,这可能导致严重错误(见6.3节)。固定格式文件占用的磁盘空间稍小,但实际影响不大。并非所有软件包都支持固定格式,且自由格式更易处理。
The alternative to free format is fixed format, in which items need not be separated by spaces in the data file. The disadvantage of this format is that it is necessary to tell the program the precise format used, and this can be complicated if you are unused to computer programming. With fixed format you can use blanks to indicate missing data, but this is a bad idea as a blank cannot be distinguished from an omission due to oversight. Also, most programs will interpret blanks as zero, which is potentially disastrous (as shown in section 6.3). Fixed format files occupy slightly less space on the disk, but this is of no real practical consequence. Not all packages can accept data in fixed format, and in any case free format is easier to deal with.

统计程序要求每个受试者每个变量都有值,自由格式的一个优点是即使缺失值也需输入某个数值。不能用空白,因为空白用于分隔相邻项目。有些程序允许用特殊符号(如 ? 或 *)表示缺失数据,否则必须用该变量不可能出现的数值(如 -1 或 99)代替,并且要记得在程序中指定该缺失值代码。
Statistical programs require a value for every variable for each subject, so a good feature of free format is that you will need to enter some quantity even when a value is missing. A blank cannot be used because blanks are used to separate adjacent items. Some programs have the useful facility of letting you indicate missing data in the file by a special symbol, such as ? or *. Otherwise you must give missing data a numerical value which is impossible for that variable, perhaps - 1 or 99, and then remember to give the appropriate instruction to the program to indicate

若无缺失值代码功能,统计软件包将不可接受。
the missing value code. The absence of a missing value code facility would make a statistical package unacceptable.

当每个个体的数据太多,无法在屏幕宽度(80字符)内显示时,程序处理方式多样,需查阅手册。
There is a variety of ways in which programs can handle the situation where you have too much data for each individual to fit onto the width of your screen (80 characters). You will need to consult your manual.

6.7.2 表单设计 6.7.2 Form design

图6.2展示了一个可用于收集图6.1中数据的表格。其部分特征已在前文描述过,如受试者的识别代码。每组方框旁的数字表示数据输入计算机时,从行首开始的字符数。缺失的数字表示每条信息之间有空格,说明数据采用自由格式。注意,患者姓名不应录入计算机文件。
Figure 6.2 shows a form that could have been used to collect the data shown in Figure 6.1. Some of its features have been described already, such as the subject's identifying code number. The numbers associated with each group of boxes indicate the number of characters from the start of the line when the data are typed into the computer. The missing numbers mean that there is a blank between each piece of information, indicating that the data will be in free format. Note that the patient's name should not be entered into the computer file.

患者姓名
Patient's name

患者编号
Patient's number

入组日期
Date of entry to study

出生日期
Date of birth

性别(1=男,2=女)
Sex (1=Male, 2=Female)

药物(1=Andreprevin,2=Doryprevin)
Drug (1=Andreprevin, 2=Doryprevin)

身高(厘米)
Height (cm)

体重(千克)
Weight (kg)

初始血压(毫米汞柱)
Initial Blood Pressure (mm Hg)

末次血压(毫米汞柱)
Final Blood Pressure (mm Hg)


图6.2 用于收集比较两种降压药试验数据的表格部分,对应图6.1中的数据。
Figure 6.2 Part of a form to collect data for a trial comparing two antihypertensive drugs, corresponding to the data in Figure 6.1.

设计用于记录数据的表格需要仔细考虑。分类变量和连续变量带来不同的问题。
The way in which forms are designed for recording data needs careful thought. Categorical and continuous variables pose different problems.

(a) 分类数据 (a) Categorical data

应该给每个可能的类别分配一个数字,如以下示例:
A number should be assigned to each possible category, as in the following examples:

糖尿病: 血型:
Diabetes: Blood group:

我强烈建议所有编码都直接写在表格上,而不是放在单独的纸张上。图6.2展示了两个简单的例子。如果必须使用大于9的编码,则需要第二个方框。
I strongly recommend that all the codes are on the form itself rather than on a separate sheet. Two simple examples are shown in Figure 6.2. If it is necessary to use codes higher than nine, a second box will be needed.

当使用固定格式时,建议避免使用0作为编码,因为某些程序无法区分0和未填写的空白框。自由格式输入则不存在此问题,因为每个变量都必须输入一个数值。
It is advisable to avoid zero as a code when fixed format is used, as some programs do not distinguish 0 from a blank corresponding to a box which has not been filled in. This is not a problem for free format input, as some number must be entered for every variable.

有些变量可能有多个非互斥的答案,如既往或同时用药。此时需要为每个感兴趣的答案设置一个是/否的方框。尽可能保持编码一致是理想的,例如所有是/否问题应使用相同的编码。
Some variables have several possible non- mutually exclusive answers, such as prior or concomitant therapy. Here it is necessary to have one yes/no box for each possible answer of interest. It is desirable to have consistent codes where possible. For example, all questions with yes or no answers should use the same codes.

有些程序允许用字母代替数字作为分类数据的编码。例如性别可用M或F表示,药物可用A或D表示。这有一定优势,但意味着数据文件不一定能被所有程序接受。
It is possible with some programs to use letters instead of numbers for categorical data. Thus sex could be entered as M or F, and drug as, say, A or D. This has some advantages, but means that the data file will not be acceptable to all programs.

(b)连续数据 (b) Continuous data

测量值应记录到与测量精度相同的程度—在记录前四舍五入数值没有任何优势。通常也不建议在记录数据时将连续变量分类,例如通过给数值范围分配数字代码。为了统计分析,最好尽可能精确地记录数据。每个数字应占一个方格,且如果相关,应显示小数点的位置,如图6.2中身高和体重所示。小数点不必占用单独的方格;如果此处省略小数点,分析前我们需要将所有身高和体重除以10。表格上标明所用单位很有用,尤其是在常用单位有多种选择时。
Measurements should be recorded to the same accuracy as that to which they are measured - there is no advantage in rounding values before recording them. Nor is it usually a good idea to categorize continuous variables when recording data, for example by allocating numeric codes to ranges of values. For statistical analysis it is desirable to have the data recorded as precisely as possible. One box should be allowed for each digit, and the location of the decimal point should be shown if relevant, as for height and weight in Figure 6.2. The decimal point does not have to have its own box; if it were omitted here we would need to divide all heights and weights by 10 before analysis. It is useful to indicate on the form the units used, especially where there are alternatives in common use.

每个方格只能放一个数字,因此必须为可能记录的最大值预留足够的方格。例如,成人体重以千克计,应在小数点前预留三个方格,因为有人体重超过100千克。即使第一个方格最终未被使用,也无妨。
Only one digit should go in each box, so it is essential to allow enough boxes for the largest value that could be recorded. Thus we ought to allow three boxes before the decimal point for adult body weight in kg because some people weigh more than . It will not matter if it turns out that the first box is never used.

填写表格的人应理解在不需要使用所有方格时,使用右侧方格的重要性。
Whoever fills in the forms should understand the importance of using the

因此,舒张压低于100时,必须填写在三个可用方格中的第二和第三个方格内。
right- hand boxes when not all boxes are needed. Thus diastolic blood pressure below 100 must be written in the second and third of the three available boxes.

(c) 日期 (c) Dates

英国通常的日期顺序是日、月、年,如图6.2所示,但在美国则是月、日、年。表格上注明所需的顺序非常重要。
The usual British ordering of dates is day, month, year, as shown in Figure 6.2, but in the USA it is month, day, year. It is important to indicate on the form which order is required.

事实上,年、月、日的顺序是一个不错的选择,尽管不太常见,但它允许通过将日期视为六位数字来简单排序数据。
In fact the order year, month, day is a good option, apart from its unfamiliarity, as it allows the data to be sorted simply by using the date as a six digit number.

(d) 缺失数据 (d) Missing data

如果你的程序接受用于此目的的符号,比如*,那么可以使用,但这意味着你的文件可能无法被其他统计程序读取。否则,应使用特殊的数字代码。最常用的方法是用数字9填满每个格子,例如未知的血压记录为999,未知的性别记录为9。
If your program will accept a symbol for this purpose, such as *, then this can be used, but it will mean that your file may not be readable by any other statistical program. Otherwise a special numeric code should be used. The most common method is to fill each box with nines, so that unknown blood pressure is recorded as 999 and unknown sex as 9.

6.7.3 多重形式 6.7.3 Multiple forms

临床数据通常是在每次见到受试者时收集的,例如在怀孕期间。当患者的随访次数不均等时,研究对象之间的信息量也会有所不同。图6.1中隐含的假设是每个受试者的信息量应相同。虽然可以使用一种称为数据库的软件存储每个受试者不等量的数据,但大多数统计软件只能处理所有受试者均有相同信息的数据集(称为矩形数据)。在开始分析之前,通常需要对这类数据集进行一定的汇总。
Clinical data are often obtained from a subject each time they are seen, for example during pregnancy. When patients are not seen equally often there will be varying amounts of information among the subjects being studied. It is implicit in Figure 6.1 that there should be the same amount of information about each subject. Although unequal amounts of data per subject can be stored using a type of software known as a database, most statistical computer programs can only deal with data sets where the same information is available for all subjects (called rectangular data). Some summarizing of such data sets will be necessary before analysis can begin.

绝不应简单地将每次随访视为独立的数据集—将来自同一患者的多条记录当作多名患者的数据是完全无效的。在研究设计阶段,即使是基于回顾性病例记录的研究,也应考虑数据量的差异。这类数据在组织成适合统计分析的形式时可能极为复杂。如果必须从每个受试者收集多组数据,我建议寻求专家协助。
The simple expedient of treating each visit as a separate data set should never be adopted - it is completely invalid to treat multiple records from one patient as if they were from several patients. The amount of data should be considered when the study is being planned, even for retrospective studies based on examining case notes. This type of data can be extremely difficult to organize appropriately for statistical analysis. I recommend expert assistance if it is necessary to collect multiple sets of data from each subject.

6.7.4 文件中数据的分析 6.7.4 Analysis of data on a file

上述格式的一个结果是,计算机文件中的每一行代表一个个体,不同的变量分布在各列。许多分析中我们希望比较受试者的亚组,因此有必要
A consequence of the formats described above is that on the computer file each row represents an individual, with different variables in columns. In many analyses we wish to compare subgroups of subjects, so it is necessary

程序能够接受所有数值都在同一列中,而子组指示符在另一列中的数据。例如,对于图6.1和6.2中显示的数据,我们希望比较接受不同治疗的患者的最终血压值。一些程序要求比较的数据集必须位于不同的列中,这种情况下它们不适合对以所述方式存储的文件数据进行统计分析。这是评估统计软件时需要考虑的另一个特点。
for the program to accept data where all the values are in a single column, and an indicator of the subgroup is in a different column. For example, for the data shown in Figures 6.1 and 6.2 we would wish to compare final blood pressures for patients receiving the different treatments. Some programs expect sets of data to be compared to be in different columns, in which case they would be unsuitable for statistical analysis of data stored in files in the manner described. This is a further feature to be considered when evaluating statistical software.

6.8 绘图 6.8 PLOTTING

使用计算机的一个主要优势是能够绘制数据图。无论是在屏幕上还是纸上绘图,都可以生成两种类型的图—折线图或高分辨率图。第一种方法使用常见的“字母数字”字符集,将每个点尽可能准确地放置在其正确位置,而第二种方法利用计算机的图形功能,提供更精确的绘图。图6.3显示了图11.4中数据的折线图;大多数统计软件包都能生成折线图。重合的点通常通过显示点的数量(最多9个)来表示。指定坐标轴的比例和区分不同组别通常较为困难。大多数统计软件包可以绘制直方图和散点图,有些还能生成箱线图或茎叶图。折线图在数据分析中非常有用,但质量不是最高。直到最近,少数统计软件包能生成高分辨率图,但改进的图形功能正变得越来越普遍。本书中的图形
The ability to plot data is a major advantage of using a computer. There are two types of plot that can be produced, whether plotting on the screen or on paper - a line plot or a high- resolution plot. The first method uses the usual 'alphanumeric' character set and places each point as near as possible to its correct location, while the second uses the graphical capability of the computer to give a much more accurate plot. Figure 6.3 shows a line plot of the data shown in Figure 11.4; line plots can be produced by most statistical packages. Coincident points are usually indicated by showing the number of points (up to 9). It is often difficult to specify the scaling of axes and to indicate different groups. Most statistical packages can plot histograms and scatter diagrams, and some can produce box- and- whisker or stem- and- leaf plots. Line plots are very useful when analysing data, but they are not of top quality. Until recently few statistical packages could produce high- resolution plots, but improved graphical facilities are becoming more common. The figures in this book were


图6.3 显示了图11.4中数据的折线图,涉及哺乳动物母体体重对胎儿体重的对数关系。
Figure 6.3 Line plot of data shown in Figure 11.4 relating log maternal weight to log fetal weight in mammals.

是在个人计算机上(使用STATA软件包)生成,并通过高分辨率激光打印机打印的。
produced on a personal computer (using the package STATA) and printed on a high- resolution laser printer.

6.9 计算机的其他用途 6.9 OTHER USES OF COMPUTERS

计算机在统计学上的显著主要价值在于数据分析,如前所述,还包括绘制图表。然而,还有一些其他应用也很有用,主要涉及随机数。
The obvious main statistical value of computers is in the analysis of data and, as just indicated, for producing graphs. However, there are some other applications which can be useful, mostly relating to random numbers.

我们在第5章中看到随机分配在研究中的重要性,尤其是在临床试验中分配治疗。表B13可用于此目的,它是由计算机程序通过一种算法生成的,该算法产生的数字序列几乎具有随机数的所有特性。使用表的缺点是每次必须使用同一组数字,尽管可以从表中的任意位置开始。更好的方法是每次使用合适的算法生成新的随机数,许多统计程序都能做到这一点。算法的质量各异,但都足以满足治疗分配的需要。
We saw in Chapter 5 the importance of random allocation in research, especially for allocation of treatments in clinical trials. Table B13, which can be used for this purpose, was generated by a computer program using an algorithm that produces a sequence of numbers which have virtually the same properties as random numbers. The disadvantage of using a table is that you must use the same set of numbers each time, although you can start at an arbitrary place in the table. It is better to generate new random numbers each time using a suitable algorithm, and many statistical programs can do this. Algorithms vary in quality, but all are likely to be good enough for treatment allocation.

表B13中的数字是0到9之间的数字,因此是均匀分布的随机样本,如第4章所述。有时我们希望从其他分布中获得随机样本,特别是正态分布,以便在已知条件下研究变异性。第4.7图展示了从标准正态分布中抽取的50个随机样本,这就是这种用法的一个例子。多个程序可以生成来自正态分布的随机样本。
The numbers in Table B13 are digits in the range 0 to 9, and are thus a random sample from a Uniform distribution, as described in Chapter 4. Sometimes we wish to obtain a random sample from some other distribution, especially the Normal distribution, in order to study variability under known conditions. An example of this use was seen in Figure 4.7 which showed random samples of size 50 from a standard Normal distribution. Several programs can generate random samples from a Normal distribution.

研究在特定情境下发生的情况是称为模拟法的一种简单方法。其思想是研究在对过程性质做出某些假设下会发生什么,以及改变假设的影响。第8章中使用模拟来说明从具有特定特征的总体中抽取样本的变异性。
Investigating what happens in a defined situation is a simple example of an approach known as simulation. The idea is to study what happens under certain assumptions about the nature of a process, and what the effect is of varying the assumptions. Simulation is used in Chapter 8 to illustrate variability among samples drawn from a population with specified characteristics.

6.10 计算机的误用 6.10 MISUSES OF THE COMPUTER

本章前面已经提到使用计算机的一些缺点。还有三个应避免的计算机误用。
Some disadvantages of using computers were given earlier in this chapter. There are three further misuses of computers that should be avoided.

(a) “数据挖掘” (a) 'Data-dredging'

许多研究有明确的目标,但也会收集其他“可能有趣”的信息。在没有明确目标的研究中,很容易进行大量统计分析,希望能发现一些有趣的结果。正如我们将在第8章看到的,当总体中不存在真实关系时,仅凭样本数据就有很大可能偶然发现某种表面关系。
Many studies have clearly defined objectives, but other information is collected because it 'may be interesting'. It is easy to perform a large number of statistical analyses in the hope that something interesting will turn up, especially in studies without clear objectives. As we will see in Chapter 8, there is a good chance of finding some apparent relationship in

因此,我在第5章强调,主要目标及主要分析应事先明确。任何探索性分析最多只能作为生成假设的手段,供后续研究验证。
a sample purely by chance when there is no real relationship in the population. For this reason I stressed in Chapter 5 that the main objectives, and thus the principal analyses, should be clearly identified in advance. Any exploratory analyses should be considered as useful, if at all, only for generating hypotheses for examination in further studies.

(b) 过度复杂 (b) Over-complexity

由于方法学可用,可能会诱使你对数据进行复杂的统计分析,但这并非良好的统计实践。分析应限制在回答相关问题所需的最低限度。保持分析简洁的重要原因之一是,更容易向其他研究者解释你所做的工作和发现。
It may be tempting to subject your data to a complex statistical analysis because the methodology is available, but this is not good statistical practice. The analysis should be restricted to the minimum necessary to answer the relevant questions. One important reason for keeping analyses simple is that it is much easier to explain to other researchers what you did and what you found.

(c) 虚假的精确度 (c) Spurious precision

计算机通常会给出多位有效数字的结果,但报告前几乎总应进行四舍五入。我曾见过发表的方程,声称能预测出生体重精确到 ,妊娠期长短精确到十分钟,这些结果看起来直接来自计算机输出。第2.8节提供了关于统计分析结果适当数值呈现的一些指导,后续章节也有进一步评论。
Computers usually produce results to many significant figures, but they should nearly always be rounded before being reported. I have seen published examples of equations purporting to predict birthweight to the nearest and length of gestation to the nearest ten minutes, results which appeared to have come straight from computer output. Some guidance on the appropriate numerical presentation of the results of statistical analysis was given in section 2.8, and there are further comments in subsequent chapters.

6.11 结语 6.11 CONCLUDING REMARKS

除极小的数据集外,统计分析最好使用计算机,因为它带来了诸多好处。然而,虽然计算机消除了手工计算的繁琐和错误,但存在原始数据未被适当检查的风险。下一章将讨论数据的初步检查。此外,统计分析易被执行而缺乏对所用方法的真正理解。正确使用计算机对统计分析极为有益,但在必要时仍需专家指导。
For all but the smallest data sets it is desirable to use a computer for statistical analysis because of all the benefits indicated. However, while computers remove the drudgery and errors associated with hand calculations, there is a danger that the raw data are never examined properly. The next chapter discusses the preliminary inspection of data. Also, it is easy to perform statistical analyses without a true understanding of the methods used. Properly used, computers are enormously beneficial for statistical analysis, but they do not obviate the need for expert advice when appropriate.

7 准备分析数据 7 Preparing to analyse data

没有任何统计技术能从质量可疑的数据中得出“良好”的结果。
No statistical technique will ever yield 'good' results from data of dubious quality.

Buyse (1984)
Buyse (1984)

7.1 引言 7.1 INTRODUCTION

在分析一组数据之前,尽可能检查数据是否正确非常重要。测量时、最初记录数据时、从原始来源(如病历)转录时,或输入计算机时,都可能出现错误。我们通常无法确定什么是正确的,因此我们只关注确保记录的数值合理。这个过程称为数据检查(或数据清理)。我们不能指望发现所有转录和数据录入错误,但希望能发现主要错误。正如我们将看到的,正是这些大错误会影响统计分析。如果数据是在计算机上分析的,那么检查应在数据输入计算机后进行。对于非常大的调查或临床试验,数据清理可能是一个漫长的过程。
Before analysing a set of data it is important to check as far as possible that the data seem correct. Errors can be made when measurements are taken, when the data are originally recorded, when they are transcribed from the original source (such as from hospital notes), or when being typed into a computer. We cannot usually know what is correct, so we restrict our attention to making sure that the recorded values are plausible. This process is called data checking (or data cleaning). We cannot expect to spot all transcription and data entry errors, but we hope to find the major errors. As we will see, it is the large errors that can influence statistical analyses. If the data are being analysed on a computer, then checking should take place after the data have been entered into the computer. For very large surveys or clinical trials cleaning the data may be a lengthy process.

同时,筛选数据以识别可能在分析过程中引起困难的特征也很重要。本章考虑三个具体方面—缺失数据、异常值以及可能需要的数据转换。检查和筛选的内容相似,实际操作中可以同时进行。
It is also important to screen the data to identify features that may cause difficulties during the analysis. Three specific aspects are considered in this chapter - missing data, outlying values, and the possible need for data transformation. Aspects of checking and screening are similar and in practice they can be carried out at the same time.

本章的思想特别针对变量多或受试者多(或两者兼有)的研究,但一般原则适用于任何研究。开始实质性分析前,仔细检查数据至关重要。
The ideas in this chapter are particularly aimed at studies with many variables or subjects or both, but the general principles apply to any study. It is important to examine the data carefully before proceeding to the substantive analysis.

7.2 数据检查 7.2 DATA CHECKING

记录数据中错误很常见。例如,记录的数值可能因单位混淆而错误。
Errors in recorded data are common. For example the recorded values may be wrong because of confusion over the correct units of measurement.

数字在转录时可能被颠倒,或在输入计算机时被错误键入。数据检查旨在识别并尽可能纠正数据中的错误。显然,原始数据中的错误通常无法纠正,但如果查阅原始记录,可以纠正后期引入的错误。
digits may be transposed when data are transcribed, or data may be mistyped when being entered onto a computer. Data checking aims to identify and, if possible, rectify errors in the data. Clearly errors in the original data cannot usually be rectified, but errors introduced at a later stage can be put right if the original record is consulted.

如第6.6节所述,一个重要的第一步是检查数据是否已正确输入计算机文件。对于大型文件,最好采用双重录入,即重新输入数据并与第一版进行比较,最好使用专门设计的计算机程序。对于小型数据集,最简单的方法是由一人朗读计算机中的数据,另一人对照原始数据进行核对。
As noted in section 6.6, an important first step is to check that the data have been typed into the computer file correctly. For large files double entry is best, whereby the data are retyped and compared with the first version, preferably using a computer program designed for this purpose. For small data sets the simplest way is for one person to read aloud the data from the computer with another person checking against the original data.

数据检查可能会发现一些观察值虽然合理,但远离数据的主要部分。也可能发现一些预期的观察值缺失。这些问题将在第7.3节和7.4节中讨论。
Checking the data is likely to reveal some observations that, while plausible, are distant from the main body of the data. It is also likely to reveal that a number of intended observations are missing. These problems are discussed in sections 7.3 and 7.4.

7.2.1 分类数据 7.2.1 Categorical data

对于分类变量,检查所有记录的数据值是否合理很简单,因为预先指定的值数量是固定的。例如,如果我们有四个血型编码,如下所示:
For categorical variables it is simple to check that all recorded data values are plausible because there is a fixed number of pre- specified values. For example, if we have four codes for blood group, as follows

那么我们期望数据中只出现值1、2、3或4,除非有受试者信息缺失。如果缺失值按照第6章的建议编码为9,那么任何编码为0、5、6、7或8的血型显然是错误的。
then we expect to find only values 1, 2, 3 or 4 in the data, except for any subjects with missing information. If missing values are coded as 9, as recommended in Chapter 6, then we know that any blood group coded as 0, 5, 6, 7, or 8 is clearly wrong.

计算机分析中得到的0值可能表示血型未填写—大多数计算机程序无法区分空白和零。在此例中,O型可能被编码为0而非3。应尽可能核查错误值(必要时追溯至原始信息来源)。如果发现错误,应将值更改为有效编码之一,即1、2、3或4;若无法确认,则应使用缺失值编码。
Values of 0 obtained from computer analysis may indicate that the blood group was left blank - most computer programs do not distinguish blanks and zeros. In this example it is possible that O might be coded as 0 rather than 3. Erroneous values should be checked as far as is possible (if necessary, back to the original source of the information). If a mistake is found the value should be changed to one of the valid codes, here 1, 2, 3, or 4; if not, the missing value code should be used.

7.2.2 连续型数据 7.2.2 Continuous data

对于连续测量值,我们通常无法精确判断哪些值合理,哪些不合理,且这并不重要。然而,应始终能够为变量指定合理的上下限。例如,在妊娠研究中,母亲年龄的合理范围可能设为14至45岁;在成年男性研究中,收缩压的合理范围可能设为70至。接着需要识别超出该范围的值,这一过程称为范围检查。
For continuous measurements we cannot usually identify precisely which

然而,与分类数据不同,这些超出范围的值不一定是错误的。应对可疑值进行核查,发现错误应予以纠正。剩余的超出预设范围的值,若被认为不可能而非仅仅不太可能,应保留原值或标记为“缺失”。因此,建议为每个变量设定两套界限,分别表示可疑(或不太可能)值和不可能值。定义“不可能”可能非常困难。母亲年龄或收缩压的哪些值是不可能的?又在何处界定“不可能”?
values are plausible and which are not, and it is not important to do so. It should, however, always be possible to specify lower and upper limits on what is reasonable for the variable concerned. For example, in a study of pregnancy we might put limits of 14 and 45 on maternal age, or in a study of adult males we may use limits of 70 and for systolic blood pressure. We then need to identify values outside the limits, a procedure known as range checking. Unlike the categorical data case, however, these values are not necessarily wrong. Suspicious values should be checked and any errors found should be corrected. Values remaining outside the prespecified range must either be left as they are, or recorded as 'missing' if they are felt to be impossible rather than just unlikely. It may, therefore, be advisable to have two sets of limits for each variable, denoting suspicious (or unlikely) values and impossible values. Defining what is impossible may be extremely difficult. What values of maternal age or systolic blood pressure are impossible? And at what point is 'impossible' reached?

一个常见的错误原因是小数点位置错误,可能是由于对正确计量单位的混淆或抄录错误。通常,十倍的错误会导致不可能的数值,但如果记录的数值看似合理,小数点错位可能不会被发现。只有在有错误证据时,才应纠正看似合理但不太可能的数值。
A common cause of error is misplacing the decimal point, perhaps because of confusion over the right units of measurement to use or a transcription error. Often an error by a factor of ten will give an impossible value, but if the recorded value is plausible a misplaced decimal point may well go undetected. Plausible but unlikely values should be corrected only if there is evidence of a mistake.

7.2.3 逻辑检查 7.2.3 Logical checks

当一个变量的合理取值依赖于另一个变量的值时,数据检查会更复杂。我们称之为逻辑检查。首先,某些信息通常只在特定情况下采集。例如,在一项肾移植后生存研究中,既往妊娠次数的信息仅对女性相关,因此男性应将该值设为缺失或用不同代码表示“无适用性”。(一些计算机程序允许不同类型的缺失信息。)
Checking the data is more complicated when the values of a variable that are reasonable depend on the value of some other variable. We call these logical checks. Firstly, it is common for some information to be sought only in certain cases. For example, in a study of survival after a kidney transplant, information on number of previous pregnancies is relevant only for women, and so for men should be set to missing or to a different code indicating 'not applicable'. (Some computer programs allow for different types of missing information.)

如果对研究对象有入选限制(例如临床试验的入组标准—见第15章),则应尽可能检查数据,确保所有人确实符合资格。一个常见例子是在抗高血压药物研究中,入组对象的血压有一定范围。许多研究还限制参与者的年龄。
If there were restrictions on who should be in the study (for example, entry criteria in a clinical trial - see Chapter 15), then the data should be checked as far as possible to see that everyone really was eligible. A common example is in studies of anti- hypertensive agents, in which there is a range of blood pressures for which subjects can be entered in the study. Many studies have restrictions on the age of participants.

另一个问题是当两个变量用来构造另一个变量时。新变量的值可能不可能,尽管原始两个变量的值都合理。例如,一个常用的体型指标(粗略的肥胖度测量)是
A different problem occurs when two variables are used to construct another variable. The value of the new variable may be impossible even though the values of the original variable were both reasonable. For example, a common measure of body size (a crude measure of fatness) is

“体质指数”或“Quetelet指数”,定义为体重除以身高的平方。如果此类派生变量特别重要,应在主分析开始前与记录变量一起检查。
the 'body mass index' or 'Quetelet's index', defined as Weight/Height². If such derived variables are especially important they should be checked along with recorded variables before beginning the main analysis.

更普遍地说,可能存在某些受试者的两个变量值组合极不可能,尽管各自都在可接受范围内。如果我们有两个密切相关的变量,如收缩压和舒张压,我们不期望一个人在收缩压分布的第5百分位,而在舒张压分布中却处于第95百分位。在大型研究中,逐对变量检查不现实,但对于重要变量,如抗高血压药物试验中的血压,应重点关注,最简单的方法是检查散点图。
More generally, there may be subjects who have a combination of values of two variables that is very unlikely even though each is within acceptable limits. If we have two closely related variables, such as systolic and diastolic blood pressure, we do not expect a subject at the 5th centile of the distribution of systolic pressure to be at the 95th centile for diastolic pressure. In a large study it is impracticable to consider all pairs of variables in this way, but those of major importance, such as blood pressure in anti- hypertensive drug trials, should be studied closely, most simply by examining scatter diagrams.

本节最后一种情况是同一变量在每个受试者身上测量多次。绘制每个人的测量序列图很有价值,以确保其变化合理。有时我们期望每次测量值都大于上一次,比如儿童的年度身高测量,这种情况易于验证。不幸的是,使用统计软件制作此类图表可能较难,因为少数程序能处理每个受试者的序列数据。
Lastly in this section, there is the case where the same variable is measured several times on each subject. It is valuable to plot each person's sequence of recorded values to ensure that they behave reasonably. Sometimes we will expect each measurement to be larger than the previous one, such as annual height measurements of children, and this is easily verified. Unfortunately it may be difficult to produce such plots using statistical software, as few programs can cope with serial data on each subject.

7.2.4 日期 7.2.4 Dates

记录日期在计算两个事件之间时间时非常重要。例如,我们可以根据事件日期和受试者出生日期计算其在某事件(如手术或死亡)时的年龄。其他常见计算包括事件与患者死亡之间的时间(生存时间)或首次症状与疾病诊断之间的时间。如第6章所建议,最好记录所有相关日期,因为心算时间间隔极不可靠。然而,记录日期也带来问题,因为日期特别容易出现抄录错误。
Recorded dates are important when they are used to calculate the time between two events. For example, we can calculate a subject's age at some event, such as surgery or death, from the date of the event and the subject's date of birth. Other common calculations are the time between an event and the patient's death (their survival time) or the time between the first symptom and the diagnosis of the disease. As recommended in Chapter 6, it is preferable to record all the relevant dates, as mental calculation of time intervals is extremely unreliable. However, recording dates also causes problems as they are especially prone to transcription errors.

日期应按以下方式检查:
Dates should be checked as follows:

【1】检查所有日期是否在合理的时间范围内。出生日期可能与研究纳入的年龄范围相关。注意,包含老年人的研究可能包括1900年以前的出生日期。其他事件的日期,如手术或死亡,通常会在研究的时间范围内。

  1. Check that all dates are within a reasonable time span. Dates of birth may relate to the age range for inclusion in a study. Note that studies including elderly people may include dates of birth before 1900. Dates of other events, such as surgery or death, will probably lie within the time span of the study.

【2】检查所有日期是否有效。月份中的日期应在1到31之间,依此类推,但诸如2月30日之类的日期是不可能存在的。一些计算机程序具有检查日期有效性的功能。
2. Check that all dates are valid. The day of month should lie in the range 1 to 31, and so on, but dates such as 30 February are impossible. Some computer programs have routines for checking the validity of dates.

【3】检查日期的正确顺序。不同事件的日期通常应按一定顺序排列,例如出生、手术和死亡的日期。
3. Check that dates are correctly sequenced. Often dates of different events should fall in a certain sequence, such as dates of birth, surgery, and death.

【4】检查推导的年龄和时间间隔。在完成检查(1)和(2)后,应利用日期计算感兴趣的年龄和时间间隔,如手术时的年龄或手术与死亡之间的时间。然后应对这些结果进行范围检查,如前所述。
4. Check derived ages and time intervals. After checks (1) and (2) the dates should be used to calculate ages and time intervals of interest, such as age at surgery or time between surgery and death. These should then be range checked as described earlier.

7.3 异常值 7.3 OUTLIERS

对连续变量的数据进行检查时,可能会发现一些与其他数据不符的异常值。通常,某些变量可能存在一两个异常值,而大多数变量则不会有异常值。
Checking the data for continuous variables may reveal some outlying values that are incompatible with the rest of the data. Typically there may be one or two outliers for a few variables, although for most variables there will not be any.

如前所述,应仔细检查可疑值。如果没有错误证据且该值合理,则不应更改。该规则的例外情况是值正确但调查发现该个体有特殊情况,如合并疾病。在这种情况下,排除该观察值可能是合理的。相反,仅仅因为值最大或最小就删除它们是非常危险的。也没有理由采用自动化程序,如删除所有偏离均值三倍标准差以上的值。统计技术可以用来检测可疑值,但不应决定如何处理这些值。
As already discussed, suspicious values should be carefully checked. If there is no evidence of a mistake, and the value is plausible, then it should not be altered. An exception to this rule is where the value is correct but investigation reveals that there is something special about that individual, such as a concurrent illness. Here it may be reasonable to exclude the observation. In contrast, it is especially dangerous to remove values simply because they are largest or smallest. Also, there is no justification behind automated procedures such as removing all values more than three standard deviations away from the mean. Statistical techniques can be used to detect suspicious values, but should not be used to determine what happens to them.

异常值尤其重要,因为它们可能对统计分析结果产生显著影响。由于定义上它们是极端值,包含或排除它们会对分析结果产生明显影响。举一个简单例子,表7.1显示了20名霍奇金病缓解患者血液样本中每立方毫米细胞的数量。值的均值为823.2,标准差为566.4。如果认为最高值2415是异常值并剔除它,剩余19个值的均值为739.4,标准差为436.4—这两个指标在剔除最大值后均下降。剔除单个观察值的影响,如此例所示,可能非常显著,这就是为什么应在完整分析开始前决定哪些数据将被分析。
Outliers are particularly important because they can have a considerable influence on the results of a statistical analysis. Because by definition they are extreme values, their inclusion or exclusion can have a marked effect on the results of an analysis. To take a simple example, Table 7.1 shows numbers of cells per in blood samples from 20 patients in remission from Hodgkin's disease. The mean of the values is 823.2 and the standard deviation is 566.4. If we consider that the highest value of 2415 is an outlier and discard it, the mean of the remaining 19 values is 739.4 and the standard deviation is 436.4 - both must fall when the largest value is omitted. The effect of excluding a single observation can, as here, be quite marked, which is why decisions about which data are to be analysed should be made before the full analysis starts.

数据的直方图显示分布偏斜(图7.1a),而细胞计数对数的分布则对称(图7.1b)。此外,在对数尺度上,明显的异常值看起来非常合理。变换方法将在7.6节讨论。
A histogram of the data shows that the distribution is skewed (Figure 7.1a), whereas that for the logarithm of the cell counts is symmetric (Figure 7.1b). Further, the apparent outlier is seen in the log scale to be very reasonable. Transformations are considered in section 7.6.

异常值在回归分析中可能具有较大影响,这种技术在第11章中介绍,用于寻找描述两个连续变量关系的最佳直线。
Outliers can be influential in regression analysis, a technique described in Chapter 11 for finding the best straight line describing the relation

表7.1 20例霍奇金病缓解期患者和20例弥漫性恶性肿瘤(非霍奇金)缓解期患者血液样本中 细胞数目(单位:)(Shapiro 等,1986)。
Table 7.1 Numbers of cells in blood samples from 20 patients in remission from Hodgkin's disease and 20 patients in remission from disseminated malignancies (nonHodgkin's) (Shapiro et al., 1986)

霍奇金病非霍奇金病
171116
257151
288192
295208
396315
397375
431375
435377
554410
568426
795440
902503
958675
1004688
1104700
1212736
1283752
1378771
1621979
24151252
均值823.2522.1
标准差566.4293.0
Hodgkin'snon-Hodgkin's
171116
257151
288192
295208
396315
397375
431375
435377
554410
568426
795440
902503
958675
1004688
1104700
1212736
1283752
1378771
1621979
24151252
Mean823.2522.1
SD566.4293.0

图7.2显示了12例慢性肾功能衰竭患者血液透析后血浆蛋白水平的变化,其中最年轻的患者可能是异常值。图中同时显示了包括所有数据和排除该患者后的拟合回归线。它们说明回归线会被异常值“拉拽”,无论其余数据的分布如何,尤其是在样本量较小时。单个异常点对视觉印象的影响很大。如果遮盖该可疑值,其他数据中明显无相关关系。第11章建议回归分析应始终配合散点图使用。
between two continuous variables. Figure 7.2 shows the change in plasma protein levels after haemodialysis in 12 patients with chronic renal failure, in which the youngest patient is a possible outlier. Also shown are the fitted regression lines for all the data and with that patient excluded. They illustrate that the regression line gets 'pulled' towards outlying values, regardless of the distribution of the rest of the data, especially in small samples. A single outlying point can have a considerable effect on the visual impression. If we cover the suspicious value it is clear that there is no apparent relation in the rest of the data. In Chapter 11 I suggest that a scatter diagram should always accompany regression analyses.

异常值会影响多种统计分析,常通过增加观测值的方差而掩盖感兴趣的效应。识别异常值是数据检查的重要附带收益。
Outliers can affect many types of statistical analysis, often by inflating the variance of a set of observations and so obscuring the effect of interest. Awareness of any outliers is a highly beneficial spin- off from checking the data.


非霍奇金病 图7.1显示了表7.1中有无霍奇金病患者的 细胞计数(单位:)的直方图,(a)原始数据;(b) 转换后数据。
Non-Hodgkin's disease Figure 7.1 Histograms of cell counts in patients with and without Hodgkin's disease shown in Table 7.1 (a) raw data; (b) after transformation.


图7.2展示了12例慢性肾功能衰竭患者血液透析后血浆蛋白(单位:)变化与年龄的关系,图中给出了所有数据的回归线(—)和排除最年轻患者后的回归线(- - - - - -)。数据来源:Toulon 等(1987)。
Figure 7.2 Data showing the relation between change in plasma protein after haemodialysis and age in 12 patients with chronic renal failure, showing regression lines for all data (—) and excluding the youngest patient (- - - - - -). Data from Toulon et al. (1987).

分析数据时,一个有用的策略是同时进行包含和排除可疑值的分析,如图7.2所示。如果结果差异不大,则异常值影响较小;反之,则应考虑采用替代分析方法。第8章介绍的秩次方法可能是一个合适的选择。此类问题建议寻求统计专家的帮助。
A useful strategy to adopt when analysing data is to carry out the analysis both including and excluding the suspicious value(s), as in Figure 7.2. If there is little difference in the results obtained then the outlier(s) had minimal effect, but if excluding them does have an effect it may be better to find an alternative method of analysis. Rank methods, introduced in Chapter 8, may be a good approach here. This is an area where expert statistical advice is valuable.

7.4 缺失数据 7.4 MISSING DATA

数据检查的另一个副产品是发现缺失观测值。如第6章所述,常用的做法是根据变量性质使用9、99、999或99.9等代码表示缺失,虽然少数统计软件允许用*或其他符号标示缺失值。如果用数值表示,必须在分析前告知统计软件该值为缺失,否则容易忽视一两个被编码为999的缺失值,导致分析结果严重偏差,详见第6.3节。
Another by- product of checking your data is that any missing observations will be identified. As noted in Chapter 6, the most common device is to use codes such as 9, 99, 999, or 99.9, according to the nature of the variable, although some computer programs (unfortunately few) allow * or some other symbol to indicate a missing observation. If a numeric value is used it is essential to identify the value as a missing value to the statistical software before analysing the data. It is very easy to forget that one or two values are missing, perhaps coded as 999, when carrying out an analysis. The effect on the analysis can be severe, as was illustrated in section 6.3.

使用 的优点在于不会有后续分析将缺失值代码误当作真实观测值的风险。
The advantage of using is that there is no danger that subsequent analysis will treat the missing value code as a real observation.

对于分类变量,缺失值只是一个额外的类别,因此这些个体可以包含在任何交叉列联表中。然而,在进行统计分析时,仍然重要的是计算机程序能识别该代码(如9)为缺失值。对于连续变量,识别缺失数据尤为关键。
For categorical variables missing is just an additional category and so these individuals can be included in any cross- tabulations. However, it is still important that the code (say 9) is identified as missing in a computer program when performing a statistical analysis. For continuous variables it is essential that missing data are identified.

创建新的“衍生”变量时,必须记住缺失值代码的可能性。例如,如果我们用身高和体重来计算体质指数(BMI)(见7.2.3节),且其中一个或两个变量缺失,如果未将代码识别为缺失,结果可能非常误导:
It is important to remember the possibility of missing value codes when creating a new 'derived' variable. For example, if we use height and weight to derive the body mass index (BMI) (described in section 7.2.3), and either or both variables are missing we can get very misleading answers if we have not identified the codes as missing:

身高(米)体重(公斤)BMI(体重/身高²)
1.6268.226.0
1.62999.9381.0
9.9968.20.7
9.99999.910.0
Height (m)Weight (kg)BMI (Wt/Ht²)
1.6268.226.0
1.62999.9381.0
9.9968.20.7
9.99999.910.0

在这种情况下,如果任一变量缺失,衍生值将是不可能的,但情况并非总是如此。应在构建衍生变量之前识别缺失值代码。优秀的计算机程序会在任何组成部分缺失时,将衍生变量的值设为缺失。
In this case the derived values if either variable is missing are impossible, but this will not always be the case. Missing value codes should be identified before derived variables are constructed. Good computer programs will set the value of a derived variable to missing if any of its components is missing.

日期有时只部分记录。如果缺少日期,可以将其设为15(平均月份的中间),缺少月份则可设为6或7(年中),以减少可能的误差。如果这种替代对所研究的时间跨度影响极小,则是合理的。然而,应注意这种替代不会导致两个日期顺序的颠倒。例如,手术日期为08-89,缺少日期,而死亡日期为13-08-89,如果将手术日期的日期设为15,则患者的生存时间将变为负2天。
Dates are sometimes only partially recorded. If the day is missing it can be set to 15 (halfway through an average month), and a missing month can be set to 6 or 7 (halfway through the year) to minimize the possible error. Substitutions like these are reasonable if the effect is very small compared with the time span being investigated. However, care should be taken that this substitution does not result in a reversal of the sequence of two dates. For example, if date of surgery is given as 08- 89, with the day missing, and date of death is 13- 08- 89, then setting the day of surgery to 15 will make the patient's survival time - 2 days.

7.4.1 为什么数据会缺失? 7.4.1 Why are data missing?

值得思考数据缺失的原因;特别是我们应了解是否与研究性质相关。与不可能的值一样,可能需要核实原始信息来源,确认缺失观测确实缺失。
It is worth thinking about why the data are missing; in particular we ought to know if there is a reason related to the nature of the study. As with impossible values, it may be possible to check with the original source of

缺失值往往是随机的,与研究无关。例如,某些患者可能未被询问特定问题,或血样丢失或损坏。大多数大型研究都会因类似原因存在缺失数据。然而,缺失信息可能具有信息性。在多次收集患者信息的研究中,后期缺失可能是因为患者因副作用退出研究,甚至死亡。另一种可能是患者因感兴趣变量异常反应而退出研究。例如,高血压研究中,若患者血压超过预设水平,常被撤出研究,这必然影响血压变化分析。关于此类数据的进一步讨论见14.6节。
the information that missing observations are really missing. Frequently values are missing essentially at random, for reasons not related to the study. For example, some patients may not have been asked a particular question, or a blood sample may have been lost or destroyed. Most large studies will have some missing data for reasons like these. The lack of information may, however, be informative. In a study in which information about a patient is collected on several occasions, lack of information for the later times may be because the patient was withdrawn from the study due to side- effects, or even because they died. Another possibility is that they may have been withdrawn from the study because the variable of interest responded inappropriately. For example, it is common in studies in hypertension to withdraw patients if their blood pressure rises above a pre- selected level, which must compromise an analysis of change in blood pressure. There is further discussion of this type of data in section 14.6.

对于“是”或“否”编码的信息,如特定症状的存在,可能会想将缺失值替换为“否”,理由是若症状存在信息应已被记录。但这一假设通常不成立,不应轻易做出。此问题在回顾性研究中尤为突出,例如从患者住院记录中获取数据时。
For information that is coded as 'yes' or 'no', such as the presence of a particular symptom, it may be tempting to consider replacing missing values by 'no', on the grounds that the information would have been recorded if the symptom had been present. This assumption is usually unwarranted, and should not be made lightly. This problem is most likely in retrospective studies, for example when data are obtained from patients' hospital notes.

7.5 数据筛查 7.5 DATA SCREENING

到目前为止,本章我已经讨论了尽可能检查数据正确性的各个方面。初步数据检查的另一个重要方面是评估数据是否适合预期的分析类型,这一过程有时称为数据筛查。如前所述,一个或多个异常值的存在可能显著影响甚至使分析无效。数据筛查主要关注连续数据的分布,异常值只是本节考虑的一个方面。
So far in this chapter I have considered various aspects of checking, as far as possible, that the data are correct. The other important aspect of preliminary data examination is to see how suitable the data are for the type of analysis that is intended, a process sometimes called data screening. As already indicated, the presence of one or more outliers can markedly affect, and perhaps invalidate, an analysis. Data screening is concerned largely with the distribution of continuous data, outliers being just one of the aspects considered in this section.

7.5.1 观察值的分布 7.5.1 The distribution of observations

正如后续章节将展示的,许多连续数据的统计分析方法都基于数据来自正态分布总体的假设。基于秩的替代方法通常可用,且不依赖该假设,但它们存在一定的缺点。在基于正态性假设进行分析之前,了解数据的分布非常重要。不符合正态分布的数据通常可以
As subsequent chapters will show, many types of statistical analysis of continuous data are based on the assumption that the data are a sample from a population with a Normal distribution. Alternative methods based on ranks are usually available that do not make that assumption, but they have certain disadvantages. It is important to know the distribution of the data before embarking on an analysis based on the assumption of Normality. Data that are not compatible with a Normal distribution can often be

通过变换使其接近正态分布,具体方法见第7.6节。
transformed to make them acceptably near to Normal, as described in section 7.6.

对于每个连续变量,应计算其均值和标准差(SD)。如果可能,应绘制直方图以观察分布形态。如果无法绘制直方图,则可检查分布的分位数(例如,第10、第50和第90百分位数)以判断分布是否对称。
For each continuous variable the mean and standard deviation (SD) should be calculated. If possible a histogram should be produced to see the shape of the distribution. If this is not possible then quantiles of the distribution (for example, the 10th, 50th and 90th centiles) can be examined to see if the distribution appears symmetric.

尤其对于小样本,判断数据的正态性可能较为困难。如图4.7所示,即使是来自正态分布的50个样本也可能看起来不符合正态分布。下面介绍的称为正态概率图的图形技术能更好地判断正态性。
For small samples especially it may be difficult to judge the degree of Normality of a set of data. As Figure 4.7 showed, even samples of size 50 from a Normal distribution may look non- Normal. The graphical technique called a Normal plot, described below, gives a much better idea of Normality.

检查多个变量的一个好方法是绘制所有变量两两之间的散点图矩阵。图12.2中给出了一个示例。
A good way of checking many variables visually is to produce a 'matrix' of scatter plots of all pairs of variables. An example is given in Figure 12.2.

7.5.2 正态概率图 7.5.2 The Normal plot

正态概率图基于两个思想。首先,累积频率分布比频率分布更能反映数据的形态。它受图4.7中小波动的影响较小。正态分布数据的累积频率分布呈S形,如图4.6所示。然而,仅凭累积频率分布难以判断正态性,这时第二个思想派上用场。因为所有正态分布的形态完全相同(图4.4),我们可以拉伸纵轴,使累积分布函数在数据为正态时成为一条直线。样本数据偏离正态性即表现为偏离直线。
The Normal plot is based on two ideas. First, the cumulative frequency distribution gives a better idea of the shape of the data than does the frequency distribution. It is much less affected by the small fluctuations that were seen in Figure 4.7. The cumulative frequency distribution for data that are Normally distributed has an S shape, as shown in Figure 4.6. It is, however, difficult to judge Normality from the cumulative frequency distribution, which is where the second idea comes in. Because all Normal distributions are precisely the same shape (Figure 4.4) we can stretch the vertical scale to make the cumulative distribution function a straight line if the data are Normal. Departures of the sample data from Normality are thus easily seen as departures from a straight line.

假设我们有一个变量,其总体值服从均值为34.46、标准差为5.84的正态分布。图7.3显示了(a)频数分布,(b)累计频数分布,以及(c)正态图。正态图的横轴表示观测值的数值,纵轴表示相对于均值的标准差数的相对频率。正态图纵轴标注的值对应累计百分比为 (见第4.5.1节)。绘图坐标的计算方法将在下文说明。图7.3展示了理论情况,图7.4则展示了从同一总体随机抽取的216个样本的相同过程。顶部面板显示数据的直方图,表现出一些不规则性。第二个面板显示累计频数分布,最后一个是正态图。数据在正态图中接近一条直线。
Suppose we have a variable whose values in the population have a Normal distribution with a mean of 34.46 and a standard deviation of 5.84. Figure 7.3 shows (a) the frequency distribution, (b) the cumulative frequency distribution, and (c) the Normal plot. The horizontal axis of the Normal plot shows the numerical value of the observation, and the vertical axis gives the relative frequency in terms of the number of standard deviations from the mean. The values labelled on the vertical axis of the Normal plot correspond to cumulative percentages of , , , , , and (see section 4.5.1). The calculation of the plotting coordinates is explained below. Figure 7.3 shows what happens in theory, and Figure 7.4 shows the same process for a sample of size 216 chosen at random from the same population. The top panel shows a histogram of the data, which exhibits some irregularities. The second shows the cumulative frequency distribution and the last the Normal plot. The data are close to a straight line in the Normal plot.

既然我们知道了当数据确实来自正态分布时应有的图像,我们便有了判断真实数据的依据。图7.5给出了先前讨论的216例原发性胆汁性肝硬化患者血清白蛋白值的正态图。这些数据的均值为 ,标准差为 。因此,图7.3、7.4和7.5是可以直接比较的。当我们为某组数据绘制正态图时,实际上就是在做这样的比较。图7.5(c)中的正态图非常接近直线,表明这些患者血清白蛋白值的分布接近正态分布,与图4.5一致。下面我将考虑如何量化这种接近程度。
Now that we know what sort of picture to expect when the data really do come from a Normal distribution, we have some basis for judging some real data. Figure 7.5 gives a Normal plot for the serum albumin values from the study of 216 patients with primary biliary cirrhosis previously discussed. These data had a mean of and the standard deviation was . Figures 7.3, 7.4 and 7.5 are thus directly comparable. When we produce a Normal plot for some data this is the comparison that is implicitly being made. The Normal plot in Figure 7.5(c) is very near to a straight line, indicating that the distribution of serum albumin values in these patients is near to a Normal distribution, in agreement with Figure 4.5. I shall consider below how we can quantify the nearness.

相比之下,同一患者群体的血清胆红素值分布在图4.8中显示高度偏斜,远非正态分布。图7.6(a)中数据的明显弯曲的正态图证实了这一点。然而,如第4章所述,经过对数转换后,数据近似正态分布,如图7.6(b)的正态图所示。为什么我们可能希望通过变换数据来获得近似正态分布的原因将在第7.6节讨论。
By contrast, the distribution of serum bilirubin values in the same patients was shown in Figure 4.8 to be highly skewed and not near to a Normal distribution. The markedly curved Normal plot of the data in Figure 7.6(a) confirms this finding. However, as described in Chapter 4, after log transformation the data have a nearly Normal distribution, as shown by the Normal plot in Figure 7.6(b). The reasons why we might wish to transform a set of data to get an approximately Normal distribution are discussed in section 7.6.

虽然正态图是判断一组数据是否服从正态分布的非常有用的图形工具,但它仅提供主观评估。由于抽样变异,我们知道来自正态分布的样本不会完全正态(见图4.7),尤其当样本量较小时。若数据需接近正态,量化偏离正态的程度是一种有用的方法。
While the Normal plot is a very useful graphical device for judging the Normality of a set of data, it only allows for a subjective assessment. Because of sampling variation we know that samples from Normal distributions will not be exactly Normal (see Figure 4.7) especially if the sample is small. Where it is important for the data to be close to Normal it is useful to have a method for quantifying the deviations from Normality.

7.5.3 评估偏离正态分布的程度 7.5.3 Evaluating departures from a Normal distribution

测量非正态性的一种方法是计算所谓的“高阶矩”。前两个矩已描述过—即均值和方差。然而,这些值不能反映分布的形状。我们可以通过基于以下公式的量来测量形状
One way of measuring non- Normality is to calculate what are called 'higher moments' of the distribution of data. The first two moments have already been described - they are the mean and variance. However, these values give no information about the shape of the distribution. We can measure shape by means of quantities based on

这显然是方差公式的扩展。由此我们可以导出称为偏度的量,衡量分布的非对称性,以及峰度,衡量分布的平坦或尖峰程度。然后可以将这些值与正态分布的理论值进行比较。然而,我不推荐这种方法,因为更理想的是用单一指标来评估正态性,而不是两个。
which are obvious extensions to the formula for the variance. From these we can derive quantities called skewness, which is a measure of asymmetry. and kurtosis, which is a measure of flatness or peakedness. These values can then be compared with the theoretical values for a Normal distribution. I do not recommend this approach, however, as it is preferable to have a single assessment of Normality rather than two.

评估一组数据正态性的情形
Situations in which we may wish to assess the Normality of a set of data


图7.5 216例原发性胆汁性肝硬化患者的血清白蛋白值,表示为(a)频数直方图;(b)累计频数分布;(c)正态图。
Figure 7.5 Serum albumin values of 216 patients with primary biliary cirrhosis expressed as (a) frequency histogram; (b) cumulative frequency distribution; (c) Normal plot.

在后续章节中会出现。对于许多目的来说,仅凭目测检查正态图就足够了,但如果需要更深入的分析,则更有用的方法是测量正态图的直线程度。然后我们可以计算如果总体服从正态分布,样本中出现此类值的概率。如果该概率足够大,比如大于0.05(即1/20),我们就可以认为数据与正态分布相当接近。此过程是标准统计推断方法的一个例子,下一章将正式介绍并详细讨论。
arise in subsequent chapters. For many purposes it is not necessary to do more than check the Normal plot by eye, but if something more is required then a more useful approach is based on measuring the straightness of the Normal plot. We can then calculate the probability that such a value would be obtained in a sample if the population had a Normal distribution, and if this probability is large enough, say greater than 0.05 (1 in 20), we conclude that the data are reasonably near to a Normal distribution. This procedure is an example of a standard statistical approach to inference which is introduced properly and discussed in detail in the next chapter.

Shapiro-Wilk W 正态性检验在多个统计软件中均可使用。然而,如果该检验不可用,可以相对容易地计算密切相关的 Shapiro-Francia 检验。但该检验直到第11.6节才会介绍,因为需要本章引入的分析方法。对于图7.5中的白蛋白数据,Shapiro-Wilk W检验得到的概率较大,为0.76;而图7.6中的胆红素数据概率非常小(见表7.2)。显然,血清白蛋白数据符合正态分布,而原始血清胆红素值则不符合。对数转换后的血清胆红素值的正态图(图7.6b)除少数低值外基本呈直线,但W检验显示数据仍与正态分布不符(见表7.2)。这说明在大样本中,该检验能够检测到少量非正态性,而这在大多数情况下并不重要。正如图4.9所示,log胆红素数据与正态分布非常相似。因此,在评估正态图和W检验结果时需要一定判断力。
The Shapiro- Wilk W test for Normality is available in several statistical computer programs. However, if it is unavailable the closely related Shapiro- Francia can be calculated fairly easily. It is, however, not described until section 11.6 as it requires a method of analysis introduced in that chapter. For the albumin data shown in Figure 7.5 the Shapiro- Wilk W test yields a large probability of 0.76, while the bilirubin data in Figure 7.6 yield a very small probability (Table 7.2). Clearly the serum albumin data are compatible with a Normal distribution, while the raw serum bilirubin values are not. The Normal plot of the log serum bilirubin values (Figure 7.6b) is straight except for a few values at the lower end, but the W test shows that the data are not at all compatible with a Normal distribution (Table 7.2). This illustrates the fact that in large samples the test is able to detect small amounts of non- Normality, that in most circumstances would be unimportant. As Figure 4.9 showed, the log bilirubin data are very similar to a Normal distribution. Thus some judgement is required in assessing the Normal plot and the W test.

表7.2 Shapiro和Wilk的W检验应用于216个血清白蛋白、血清胆红素及其对数值(来源:Christensen等,1985年)
Table 7.2 Shapiro and Wilk's W test applied to 216 values of serum albumin, serum bilirubin and log serum bilirubin (from the study by Christensen et al., 1985)

变量W概率 (P)
血清白蛋白0.9860.76
血清胆红素0.668< 0.0001
对数血清胆红素0.956< 0.0001
VariableWProbability (P)
Serum albumin0.9860.76
Serum bilirubin0.668&lt; 0.0001
Log serum bilirubin0.956&lt; 0.0001

非正态性通常在分布的尾部最为明显。异常值在正态图上表现为一个或多个点偏离其余数据的线性趋势。即使只有一个异常值,也可能导致数据未通过Shapiro-Wilk检验。系统性曲线,如图7.6(a)所示,表明分布偏右(偏斜);而S形曲线则表示分布两端的值过多或过少,相较于正态分布,如图7.7和7.8所示。
Non- Normality is usually most marked in the tails of the distribution. Outliers will show up in a Normal plot as one or more points lying away from the general linear trend of the rest of the data. Even one outlier can make the data fail the Shapiro- Wilk test. Systematic curvature, as seen in Figure 7.6(a), indicates skewness (to the right), while an S shaped plot will indicate either too many or too few values in both tails of the distribution in comparison with a Normal distribution, as shown in Figures 7.7 and 7.8


图7.8 分布尾部值过少的数据 ,均值 ,标准差 (a) 直方图;(b) 正态图。
Figure 7.8 Data with too few values in the tails of the distribution , mean , SD (a) histogram; (b) Normal plot.

正态图还可以揭示数据中两种分布的混合。图7.9展示了一窝猪仔出生体重的正态图,显示一组正常生长的猪仔和三只体重较轻的“弱仔”猪仔(Royston等,1982)。不同的斜率表明两组假定数据的标准差不同。
respectively. Normal plots can also reveal a mixture of two distributions in the data. Figure 7.9 shows a Normal plot of birth weights of one litter of piglets, suggesting a normally grown group and a group of three 'runt' piglets with lower weights (Royston et al., 1982). The different slopes indicate different standard deviations in the two putative groups.

7.5.4 构建正态图 7.5.4 Constructing a Normal plot

(本节较为技术性,略读不会影响连贯性。)
(This section is more technical and can be omitted without loss of continuity.)

图7.7等正态图中,轴的刻度是以观测值标准差的倍数线性排列。构建正态图时,先将观测值按升序排列,然后将数据点绘制在对应的正态分数上。正态分数是指在给定样本大小的正态分布中,排名为第的观测值相对于均值的标准差数。许多统计软件能计算正态分数并绘制正态图,有些甚至能轻松生成。手工绘制正态图时,可使用特殊的正态概率纸,其刻度对应正态分布的百分位数。先对观测值排序,然后绘制第
The scale of the axis in the Normal plots such as Figure 7.7 is linear in multiples of the standard deviation of the observations. The Normal plot is constructed by sorting the observations into ascending order and then plotting the data against the corresponding Normal scores. The Normal score is the number of standard deviations below or above the mean that we expect to find the observation with a given rank from a sample from a Normal distribution of a given size. Many statistical programs can calculate Normal scores for plotting against the data, and some can produce Normal plots easily. For drawing a Normal plot by hand there is special Normal probability paper with divisions corresponding to the percentage points of the Normal distribution. The observations are sorted and then the ith


图7.9 猪仔出生体重的正态概率图(Royston 等,1982)。
Figure 7.9 Normal plot of piglet birth weights (Royston et al., 1982).

观测值与对应百分比 的正态分数作图,计算公式为
observation is plotted against the Normal score corresponding to the percentage , given by

7.6 为什么要转换数据? 7.6 WHY TRANSFORM DATA?

7.6.1 转换为正态分布 7.6.1 Transforming to Normality

正如接下来几章将看到的,大多数用于分析连续数据的统计方法(参数方法)都包含了关于样本所抽取总体数据的假设。特别是,它们假设数据来自一个值服从正态分布的总体。因此,我们期望数据符合这一假设,这也是为什么需要第7.5节中描述的正态性检验。我们常发现对数据进行变换后,分布会更接近正态分布。其中最常见的是对数变换。第4.6节介绍了对数正态分布,即通过取对数可以转化为正态分布的分布。图7.6中的血清胆红素数据和图7.1中的细胞计数就是例子。
As will be seen in the next few chapters, most statistical methods (parametric methods) for analysing continuous data incorporate assumptions about the data in the population from which the sample was drawn. In particular they include an assumption that the data come from a population where the values are Normally distributed. Thus we expect the data to be consistent with that assumption, which is why we need the test of Normality described in section 7.5. We often find that a transformation of the data will yield a distribution that is much nearer to a Normal distribution. By far the most common is the logarithmic or log transformation. The Lognormal distribution was introduced in section 4.6, as the distribution that can be transformed to a Normal distribution by taking logs. The serum bilirubin data shown in Figure 7.6 are an example, as are the cell counts in Figure 7.1.

对某些方法而言,分布假设并非特别关键,尤其是在样本量较大时。然而,仍有其他原因希望数据接近正态分布。许多参数方法的另一个重要假设是不同观察组具有相同的标准差。非正态数据往往伴随标准差的变化,而通过数据变换可以更好地满足这两个要求。例如,表7.1中霍奇金病和非霍奇金病患者的数据标准差分别为566.4和397.9,相差较大,但的标准差更为接近,分别为0.708和0.632,且分布更接近正态(见图7.1)。如果几个观察组的标准差与均值的比值相近,通常对数变换效果较好。该比值只对原始数据有意义,且对于只有两组数据时参考价值有限。数据的比值分别为0.69和0.56,相当接近。
For some methods the distributional assumption is not too critical, especially if the sample size is large. There are other reasons, however, for wishing data to be near to a Normal distribution. Another important assumption of many parametric methods is that different groups of observations have the same standard deviations. It is often the case that variation in standard deviations accompanies non- Normal data, and both requirements can be met more closely after transforming the data. For example, the data in Table 7.1 for Hodgkin's and non- Hodgkin's disease patients have rather different standard deviations of 566.4 and 397.9, but the standard deviations of are much more similar, being 0.708 and 0.632, and the distributions are much nearer to Normal (Figure 7.1). The log transformation is likely to work well if the ratio of the standard deviation to the mean is similar among several groups of observations. This calculation has meaning only for the raw data, and may not be very helpful with just two groups. For the data the ratios are 0.69 and 0.56, which are reasonably similar.

其他有时使用的变换包括平方根变换和倒数变换。图7.10展示了血清胆红素数据在不同变换前后的直方图。平方根变换(图7.10c)效果不如对数变换显著,通常用于变量为计数(频数)且预期服从泊松分布的情况。
Other transformations sometimes used are the square root and reciprocal transformations. Figure 7.10 shows histograms of the serum bilirubin data before and after different transformations. The square root transformation (Figure 7.10c) is less dramatic than taking logs. It is particularly used when the variable is a count (frequency) and thus would be expected to follow a

倒数变换(图7.10d)效果比对数变换更剧烈(注意它会颠倒观察值的顺序),当数据极度偏斜时可能有用。Gore(1982)描述了对肾移植患者血浆肌酐值使用倒数变换,以及对肿瘤大小测量使用平方根变换的情况。然而,这些变换使用不普遍,且只要对数变换能取得满意结果,通常优先采用对数变换(见第9.7节)。有时使用特定变换有强烈的逻辑理由,例如立方根变换适用于体积数据,某段距离行走时间的倒数则表示速度。
Poisson distribution. The reciprocal transformation (Figure 7.10d) has a much more drastic effect than taking logs (note that it reverses the order of the observations), and may be useful if the observed data have an extremely skewed distribution. The use of the reciprocal transformation for plasma creatinine values of kidney transplant patients and the square root transformation for tumour size measurements were described by Gore (1982). Their use is not common, however, and there are certain reasons for using the log transformation in preference to any other as long as it yields satisfactory results (see section 9.7). Sometimes there may be a strong logical reason for using a particular transformation. For example, the cube root may be appropriate for data that are volumes and the reciprocal of a recorded time to walk a certain distance will yield the speed.

变换为正态分布的另一个原因是减少异常值(即非典型值)对分析结果的影响,这一问题在图7.2中有所展示。Armitage和Berry(1987,第368页)对此总结道:“如果连续变量不过分偏离正态分布,通常更为方便。”当无法实现时,可以采用秩次(非参数)分析方法(后续章节介绍),但总体上这些方法不如参数方法理想。
Another reason for transforming to Normality is to reduce the influence of outlying (and thus atypical) values on the results of analysis, a problem illustrated in Figure 7.2. The overall picture has been well summarized by Armitage and Berry (1987, p. 368): 'It is usually convenient if continuous variables do not depart too drastically from Normal'. When this cannot be achieved we can use rank (non- parametric) methods of analysis (described in subsequent chapters), but these are in general less satisfactory than parametric methods.

对数据进行变换有时被认为是统计学家使用的技巧,这种看法基于这样一种观念:测量的自然尺度在某种程度上是神圣不可侵犯的。事实并非如此,实际上某些测量,如 值和滴度,实际上已经是对数变换后的值。然而,最好还是以原始测量尺度来呈现结果。在后续章节中,我将展示如何做到这一点。
Transforming the data is sometimes felt to be a trick used by statisticians, a belief that is based on the idea that the natural scale of measurement is in some way sacrosanct. This is not really the case, and indeed some measurements, such as values and titres, are effectively already log transformed values. It is, however, always best to present results in the original scale of measurement. In later chapters I show how this is done.

7.6.2 比例的变换 7.6.2 Transforming proportions

变换的另一个主要用途是在比例分析中。观察到的比例在0.2到0.8范围内具有相似的不确定性,但非常小或非常大的比例不确定性较小,因为它们在尺度的两端(零和一)受到一定限制。为了统计分析,我们常希望所有比例都具有相等的不确定性,这可以通过logit变换实现,定义为
The other main use of transformations is in the analysis of proportions. Observed proportions in the range 0.2 to 0.8 have similar uncertainty but very small or large proportions have smaller uncertainty as they are somewhat constrained towards the ends of the scale (zero and one). For statistical analyses we often wish to have equal uncertainty attached to all proportions, and we can achieve this by the logit transformation, which is defined by

logit变换将比例拉伸,就像正态概率图中正态分布的百分位数被拉伸一样,
The logit transformation stretches out proportions in the same way as the percentiles of the Normal distribution are stretched out in the Normal plot,

如表7.3所示。logit变换主要用于涉及比例的回归分析(第12章讨论),以及用比值比比较不同组风险(第10章描述)。
as Table 7.3 shows. The logit transformation is mainly used in regression analysis involving proportions, discussed in Chapter 12, and with the use of odds ratios to compare risks in different groups, described in Chapter 10.

表7.3 比例 的logit变换效果
Table 7.3 Effect of logit transformation of a proportion

plogit(p)
0.01-4.60
0.05-2.94
0.10-2.20
0.25-1.10
0.500.00
0.751.10
0.902.20
0.952.94
0.994.60
plogit(p)
0.01-4.60
0.05-2.94
0.10-2.20
0.25-1.10
0.500.00
0.751.10
0.902.20
0.952.94
0.994.60

7.7 数据的其他特征 7.7 OTHER FEATURES OF THE DATA

本章前面几节讨论了分析前筛选数据时应关注的主要特征。本节考虑两个不那么明显但能为研究提供洞见的数据检查方面。
The previous sections of this chapter have discussed the main features to look for when screening data before analysis. This section considers two less obvious aspects of data examination that can shed light on a study.

7.7.1 数字偏好 7.7.1 Digit preference

当人们测量某个量时,可能不会非常准确。测量越困难,观察者内部的变异性越大,同时潜意识偏差的可能性也越高。数字偏好是指个体在记录观察值时无意识地施加个人偏见的现象。我们可以在测量值的最后一位数字中观察到数字偏好。例如,身高通常以整厘米为单位测量,血压通常以最接近的 为单位测量。在大量观察中,我们期望每个终止数字(0到9)的身高测量次数大致相等,血压测量的终止数字(0、2、4、6、8)次数也应大致相等。实际上,我们常常观察到明显偏离预期均匀分布的情况。有时这是因为观察者未按照研究方案规定的精度进行测量。例如,他或她可能将血压测量精度定为最接近的 。然而,很多时候分布的偏离没有明确原因—仅仅是因为某人似乎偏好某些数字,
When people measure something they may not do so accurately. The harder the quantity is to measure the greater will be the within- observer variability and also the possibility of subconscious biases. Digit preference is the name given to the way individuals can impose their personal (subconscious) prejudice on the way they record observations. We see digit preference in the final recorded digit of a measurement. For example, height is usually measured in whole centimetres, and blood pressure to the nearest . In a large series of observations we would expect to see equal numbers of height measurements with each terminating digit from 0 to 9, and equal numbers of blood pressure measurements ending in 0, 2, 4, 6 or 8. In practice we often see marked deviations from the expected Uniform distribution. Sometimes this is because the observer does not make the measurements to the precision specified in the study protocol. For example, he or she might measure blood pressure to the nearest . Often, however, the distribution varies from expected for no definable reason - it is simply that the person seems to have a preference

比如偏好以3或7结尾的数字。最常见的数字偏好形式导致数字出现过多的是
for numbers ending in, say, 3 or 7. The most common forms of digit preference lead to an excess of

  1. 0
  2. zeros
  3. 0和5
  4. zeros and fives
  5. 偶数。
  6. even digits.

对于(1),则会导致1和9的数字出现不足。
For (1) there will be a consequent shortage of ones and nines.

表7.4中的数据展示了这些特征,表中列出了一个病例对照研究中三组血压读数的末尾数字。病例组测量了两次,而对照组只测量了一次。三组数字中有两组显示出非常相似的模式,表明是同一人测量的。然而,第三组显示出不同的模式,说明测量者不同。(我后来向研究组织者确认了这一点。)注意,两位观察者都出现了0的过多现象,但他们记录血压的精度明显不同。
Several of these features can be seen in the data in Table 7.4, which shows terminal digits from three sets of blood pressure readings from a case- control study. The cases were measured twice while the controls were measured only once. Two of the three sets of digits show closely similar patterns, indicating that they were made by the same person. However, the third set shows a different pattern, showing that they must have been made by a different person. (I subsequently verified with the study organizer that this had happened.) Notice that both observers had an excess of zeros, but that they were clearly recording blood pressure to different accuracy.

表7.4 病例对照研究中记录血压的末尾数字
Table 7.4 Final digits of recorded blood pressures in a case-control study

末尾数字第一次测量病例组 第二次检查对照组
0712323
1000
201517
3000
401814
52119
60109
7010
802428
9002
总计9292102
Final digitFirst examCases Second examControls
0712323
1000
201517
3000
401814
52119
60109
7010
802428
9002
Total9292102

血压的情况尤其有趣。血压测量非常困难,因为它涉及在观察快速下降的水银柱的同时,听取声音的变化。由于血压测量中数字偏好的问题非常严重,设计了几种特殊的仪器来解决这一问题。最著名的是
The case of blood pressure is particularly interesting. Blood pressure is a very difficult measurement to take as it involves listening for a change in sound while observing a rapidly falling column of mercury. Because digit preference was such a problem with blood pressure several special machines were designed to get round the problem. The best known is the

“随机零点血压计”,它包含一个高度随机的隐藏水银柱,在每次测量前调整。记录的血压是观察到的水银柱高度与随后测量的隐藏水银柱高度的总和。然而,即使使用该仪器,也可能无法消除数字偏好的强烈影响(Silman,1985)。
'random- zero sphygmomanometer' which incorporates a second, hidden column of mercury of random height which is adjusted before each measurement. The recorded blood pressure is then the sum of the heights of the observed column of mercury and the subsequently measured hidden column. However, even the use of this machine may not remove the strong effect of digit preference (Silman, 1985).

另一个数字偏好的例子见于图7.5中的白蛋白数据。第二和第三个图中的阶梯状变化是因为许多值被记录为整数(单位为),而非保留一位小数。
Another example of digit preference is seen in the albumin data in Figure 7.5. The steps in the second and third plots are due to many values having been recorded as a whole number (in ) rather than to one decimal place.

数字偏好的一个奇特特征是,即使你知道这一现象,它仍可能存在于你的测量中。数字偏好很少会对数据分析产生重要影响,但它是数据筛查的一个有用产物,可以帮助你了解测量是如何进行的。
A curious feature of digit preference is that even if you know about the phenomenon it is still likely to be present in your measurements. Digit preference will rarely have an important influence on the data analysis, but it is another useful product of data screening that you may see how the measurements were made.

7.7.2 隐藏的时间效应 7.7.2 Hidden time effects

许多研究是在一段时间内进行的。通常隐含假设不同时间收集的数据是可比的,但情况并非总是如此。可能存在两种主要的隐藏时间效应。较为人知的是季节性或昼夜节律(24小时)变化。例如,许多疾病的发病率具有明显的季节性,许多激素水平呈现昼夜“节律”。这类效应广为人知,设计研究时避免相关问题并不困难。例如,建议在同一时间对同一受试者重复测量血压,因为血压具有明显的昼夜节律,早晨最高。关于此类数据的进一步讨论见第14.7节。
Many studies are carried out over a period of time. It is usually implicitly assumed that the data collected at different times are comparable, but this will not always be the case. Two main types of hidden time effect may exist. The better known effect is that of seasonal or circadian (24 hour) changes. For example, incidence rates of many diseases are strongly seasonal, and the levels of many hormone levels display a circadian 'rhythm'. Many effects of this nature are well- known, and it is not difficult to design studies to avoid problems. For example, it is advisable to take repeat measurements of blood pressure from the same subject at the same time of day because blood pressure has a strong circadian rhythm, being highest in the morning. There is further discussion of this type of data in section 14.7.

另一种可能的隐藏时间效应较少被认识到。在受试者在数月或数年内招募的研究中,受试者特征或测量值可能发生变化。例如,在之前讨论的原发性胆汁性肝硬化研究中(Christensen等,1985),发现患者入组时的血清胆红素值在7年招募期间逐渐下降(Altman和Royston,1988)。血清胆红素是肝功能的良好指标,因此后期入组的患者比早期入组的患者病情较轻。由于这是一个随机试验,患者在整个期间随机接受硫唑嘌呤或安慰剂,患者特征的时间趋势并不重要。(但这说明了临床试验中使用同期对照的原因之一—见第15章。)
There is a second type of possible hidden time effect that is not widely recognized. In a study in which subjects are recruited over some months or years it is possible that there may be changes in the characteristics of the subjects or in the measurements made on them. For example, in the study of primary biliary cirrhosis previously discussed (Christensen et al., 1985) it was found that the serum bilirubin values of patients entering the trial steadily declined over the 7 years of patient recruitment (Altman and Royston, 1988). Serum bilirubin is a good indicator of liver function, so patients joining the study towards the end of the trial were rather less ill than those joining at the beginning. As this was a randomized trial, with patients given azathioprine or placebo at random throughout the period, the time trend in patient characteristics was not important. (It indicates however, one of the reasons for using concurrent controls in clinical trials - see Chapter 15. )

如果知道观察日期(我建议记录),则可以简单地将数据绘制成时间序列,观察是否存在趋势。Altman和Royston(1988)对此问题有更深入的讨论并给出其他实例。
If the date of observations is known (and I recommend that it is recorded) then it is simple to plot the data against time to see if there are any trends. Altman and Royston (1988) discuss this issue further and give other examples.

7.8 结论性评论 7.8 CONCLUDING REMARKS

本章讨论了检查数据集一致性以及在可能的情况下准确性的方法,以及在分析前对数据进行筛选的步骤。这些程序对任何研究都很重要,尤其适用于大型数据集。没有计算机,这些步骤不太实用,但分析数据本身也需要计算机,因此先用计算机生成上述描述性统计表和图形是一个相对简单的延伸。唯一可能的例外是正态概率图(Normal plot),并非所有统计软件都能绘制。关于这些内容以及大型研究中质量控制的其他方面,Buyse(1984)有进一步讨论。
This chapter has dealt with ways of checking the consistency and, where possible, the accuracy of a set of data, and of screening the data prior to analysis. These procedures are important for any study, although perhaps particularly relevant to large data sets. They are not terribly practical without a computer, but a computer will also be needed to analyse the data, so it is a relatively simple extension to use the computer first to produce the descriptive tabulations and graphs described above. The possible exception is the Normal plot, which cannot be performed by all statistical programs. Further discussion of these matters, together with other aspects of quality control in large studies, is given by Buyse (1984).

为了清晰起见,本章将数据检查和筛选的各个方面分别讨论。然而在实际操作中,可以在一次分析中同时进行范围检查、异常值和缺失值的查找,以及数据分布形态的检验。
For clarity the various aspects of data checking and screening have been considered separately. In practice, however, it is possible to perform range checks, look for outliers and missing values, and examine the shape of the distribution of a set of data in a single analysis.

尽管这些方法不总被视为统计方法学的一部分,但它们是统计分析的必要环节,帮助你核实数据的正确性。前期花时间检查数据是非常值得的;如果数据错误直到主要分析阶段才被发现,就必须重新开始。数据筛选还能帮助你熟悉数据。这个想法有些抽象,但通过熟悉数据,你能更好地选择合适且有效的分析方法。
Although not always discussed as part of statistical methodology the methods described in this chapter are an essential part of statistical analysis, allowing you to check the correctness of your data. Time spent at the beginning checking the data is time well spent; errors in the data that are not detected until the main analysis is under way will require everything to be redone. Screening the data also allows you to get a feel for the data. This last idea is rather nebulous, but by familiarizing yourself with the data you should be much better equipped to choose appropriate and valid methods of analysis.

EXERCISES

【7】1 下页表格显示了一项涉及20名慢性充血性心力衰竭患者的研究数据(Caruana 等,1988)。表中展示了两项测量值—射血分数(ejection fraction),反映左心室功能障碍,以及肺动脉楔压(pulmonary arterial wedge pressure):
7.1 The table overleaf shows data from a study of 20 patients with chronic congestive heart failure (Caruana et al., 1988). Two measurements are shown - ejection fraction, which is a measure of left ventricular dysfunction, and pulmonary arterial wedge pressure:

患者射血分数(%)楔压(毫米汞柱)
12815
22614
34215
42912
51637
62130
7257
83514
93028
103613
11375
124113
132024
14268
153813
162617
171027
181829
19108
20315
PatientEjection fraction (%)Wedge pressure (mm Hg)
12815
22614
34215
42912
51637
62130
7257
83514
93028
103613
11375
124113
132024
14268
153813
162617
171027
181829
19108
20315

有一个数值在论文中被错误抄录。哪位患者的数据最可能有误?
One value has been mistranscribed from the paper. Which patient's data is most likely to be wrong?

【7】2 使用第7.5.4节中描述的方法,绘制表7.1第一列中20名霍奇金病患者的 细胞计数的正态概率图。
7.2 Use the method described in section 7.5.4 to construct a Normal plot of the cell counts for 20 Hodgkin's disease patients given in the first column of Table 7.1.

【7】3 评论练习3.1表格中三变量的末位数字。
7.3 Comment on the terminal digits of the three variables shown in the table in Exercise 3.1.

【7】4 调查以下血清孕酮数据(同为表14.13中第2组)末位数字是否存在数字偏好现象。
7.4 Investigate the possibility of digit preference in the final digits of the following serum progesterone data (also shown as Group 2 in Table 14.13).

时间患者
123456
01.01.01.03.08.36.2
11.51.01.02.57.55.9
35.06.57.32.09.66.8
511.020.07.52.711.07.7
1016.022.518.03.411.59.0
1523.027.820.03.615.79.3
3015.019.018.914.015.212.1
459.09.012.87.315.812.2
606.08.26.37.714.011.0
1205.08.04.84.711.59.0
TimePatient
123456
01.01.01.03.08.36.2
11.51.01.02.57.55.9
35.06.57.32.09.66.8
511.020.07.52.711.07.7
1016.022.518.03.411.59.0
1523.027.820.03.615.79.3
3015.019.018.914.015.212.1
459.09.012.87.315.812.2
606.08.26.37.714.011.0
1205.08.04.84.711.59.0

统计分析的8条原则 8 Principles of statistical analysis

统计学的一个独特功能是:它使科学家能够对其结论的不确定性进行数值评估。
A distinctive function of statistics is this: it enables the scientist to make a numerical evaluation of the uncertainty of his conclusion.

Snedecor(1950)
Snedecor (1950)

8.1 引言 8.1 INTRODUCTION

当我们为研究目的分析医学数据时,目的是将从一组个体样本中获得的发现推广到所有类似个体的总体中。我们在动物和实验室研究以及许多流行病学研究中最能清楚地看到这一点,这些数据无法与具体个体对应,但这同样适用于病例对照研究、临床试验,乃至临床研究整体。虽然从临床角度我们也可能关注每个个体,但研究通常旨在总结许多个体的经验以得出一般结论。因此,统计学的主要理念之一是—统计分析的目标是利用从样本个体获得的信息,对相关总体进行推断。
When we analyse medical data for research purposes the intention is to extrapolate the findings from a sample of individuals to the population of all similar individuals. We see this most clearly in animal and laboratory studies as well as in much epidemiological research, where the data cannot be identified with individual subjects, but it applies equally to case- control studies, clinical trials and indeed to clinical research in general. While we may also be interested in each individual from a clinical point of view, research is usually aimed at summarizing the experience of many individuals to draw general conclusions. Thus one of the main ideas of statistics is this - the aim of statistical analysis is to use the information gained from a sample of individuals to make inferences about the relevant population.

在大多数研究中,会收集一些数据用于描述性目的,例如关于被研究对象的人口统计学和临床特征的信息。数据分析的第一步是描述这些基本数据,第三章中介绍了用于此目的的简单描述方法。在观察性研究中,大多数甚至全部数据都属于此类。干预性研究,包括临床试验和实验室实验,明确是不同观察组之间的比较。我们如何比较这些数据集,尤其是在希望推广研究结果的情况下?
In most research studies some data are collected for descriptive purposes, for example information about the demographic and clinical characteristics of subjects being studied. The first step in the analysis of a set of data is to describe such basic data, and simple descriptive methods for this purpose were described in Chapter 3. In observational studies most if not all the data will be of this type. Intervention studies, which include clinical trials and laboratory experiments, are explicitly comparisons between different sets of observations. How do we compare sets of data, especially in view of the desire to generalize the findings?

接下来的七章将介绍大量针对不同研究设计和数据类型的统计分析方法。所考虑的大多数问题涉及对同类型观察组之间的比较,或在同一组个体内不同观察之间的关联。尽管医学问题和统计解决方案种类繁多,但所有方法都贯穿两种基本的统计分析途径—估计和假设检验。
The following seven chapters describe a large number of statistical methods for analysing data of various types for different research designs. The majority of the problems considered involve making comparisons between groups of observations of the same type or relating different observations within one group of individuals. Despite the enormous variety of medical problems and statistical solutions there are two basic approaches

接下来的章节将讨论这两种方法背后的原理,并对它们进行比较。本章的思想是理解统计思维的基础,因此对于后续章节的理解至关重要。
to statistical analysis that run through all of these methods - estimation and hypothesis testing. The next sections will discuss the principles behind each of these methods, and then they will be compared. The ideas in this chapter are fundamental to an appreciation of statistical thinking and thus to an understanding of the subsequent chapters.

8.2 抽样分布 8.2 SAMPLING DISTRIBUTIONS

本章最重要的概念已在4.3节介绍,即我们利用样本中获得的结果作为对相关总体真实情况的最佳估计。例如,如果我们发现一种新的银屑病治疗比标准治疗更能缓解患者症状,或男性血清胆固醇高于女性,或某种温度与光照组合能优化实验室细胞生长,那么我们期望这些结论在总体中同样成立。要使这种解释有效,样本必须具有代表性。本章介绍的方法展示了如何量化证据的强度或其不确定性。
The most important idea, already introduced in section 4.3, is that we take the results obtained in the sample and use them as our best estimate of what is true for the relevant population. So, for example, if we find that a new treatment for psoriasis relieves the symptoms of patients more often than a standard treatment, or that serum cholesterol is higher in men than women, or that a certain combination of temperature and light optimizes cell growth in a laboratory experiment, then in each case we would expect that the same is likely to be true in the population. For this interpretation to be valid the sample must be representative of the population. The methods described in this chapter show how to quantify the strength of the evidence, or its uncertainty.

正如我们在第4章中看到的,来自正态分布的小随机样本的分布可能完全不像正态分布。同样,随机样本的均值可能仅因偶然因素而与总体均值不同,尽管我们自然期望样本均值与总体均值相当接近。我们使用样本均值作为总体均值的估计值,因为这是我们拥有的最佳信息,但单个样本的均值作为总体值估计的准确性如何?我们需要一种方法来评估估计的不确定性。解决这个问题的一种方法是假设我们可以从总体中抽取许多相同大小的样本。我们能说这些样本均值相对于总体(即真实)均值的变异性如何吗?
As we saw in Chapter 4, small random samples from a Normal distribution may have a distribution that is not at all like a Normal distribution. Similarly, the mean of a random sample may differ from the population mean, just by chance, although naturally we expect the sample mean to be quite close to the population mean. We use the sample mean as an estimate of the population mean, because that is the best information we have, but how good is the mean of a single sample as an estimate of the population value? We need a way of assessing the uncertainty associated with our estimate. One way to approach this problem is to suppose that we could take many samples of a given size from the population. What can we say about the variability of the means of these samples in relation to the population (i.e. true) mean?

在第3章中,标准差被引入作为一组观测值围绕其均值的变异性的度量。测量假设样本均值围绕真实均值的变异性显然是一个类似的问题。事实证明,我们可以对多个样本均值的性质做出一些令人惊讶的强有力的陈述,并且可以利用这些信息来回答上述问题,即当我们仅取一个样本时,关于不确定性我们能说些什么。
In Chapter 3 the standard deviation was introduced as a measure of the variability of a set of observations around their mean. Measuring the variability of hypothetical sample means about the true mean is clearly a similar problem. It turns out that we can make some surprisingly strong statements about the properties of the means of several samples, and that we can use this information to answer the question posed above, namely what we can say about uncertainty when we have taken only one sample.

直观上,样本均值的变异性应具有以下特性:
It is intuitively reasonable that the variability of sample means will have the following properties:

【1】大样本的均值变异性小于小样本的均值变异性;

  1. it will be less among the means of large samples than small samples;
    【2】样本均值的变异性小于总体中个体观测值的变异性;
  2. it will be less than the variability of the individual observations in the population;

【3】样本均值的变异性随着总体中个体值变异性(标准差)的增加而增加。
3. it will increase with greater variability (standard deviation) among the individual values in the population.

以上这些确实都成立。可以用数学方法证明,随机样本均值的分布具有以下性质:
All of these are indeed true. It can be shown mathematically that the distribution of the means of random samples has the following properties:

(i) 样本均值分布的期望值等于总体均值。换句话说,样本均值的平均值就是总体均值。此外,样本方差的期望值是总体方差。
(i) The expected value of the mean of the distribution of the sample means is the same as the population mean. In other words, on average the mean of a sample will be the mean of the population. Further, the expected value of the variance of a sample is the variance of the population.

(ii) 多个样本均值的标准差的期望值是 ,其中 是总体中变量的标准差, 是每个样本的大小。这个量 被称为均值的标准误,以区别于观测值的标准差。我们可以用单个样本中观察到的标准差 代替 来估计标准误。标准误的解释和使用将在第8.4节讨论。
(ii) The expected value of the standard deviation of the means of several samples is where is the standard deviation of the variable in the population and is the size of each sample. The quantity is known as the standard error of the mean, to distinguish it from the standard deviation of the observations. We can estimate the standard error from a single sample using the observed standard deviation in that sample, , in place of . The interpretation and use of the standard error are discussed in section 8.4.

(iii) 如果总体数据的分布是正态分布,那么样本均值的分布也将是正态分布。更为重要且有些令人惊讶的是,只要样本足够大,无论总体变量的分布如何,样本均值的分布都将近似正态分布。这个重要结果被称为中心极限定理,它是许多主要统计方法的基础。有时我们关注的是一组值的和而非均值。两者仅在除以观测值数量上不同,因此中心极限定理同样适用于和与均值。
(iii) The distribution of the sample means will be Normal if the distribution of the data in the population is Normal. Further, and somewhat remarkably, the distribution of the sample means will be nearly Normal whatever the distribution of the variable in the population as long as the samples are large enough. This important result is known as the central limit theorem. It underlies many of the main statistical methods. Sometimes we will be concerned with the sum of a set of values rather than the mean. The two differ only with respect to division by the number of observations, so the central limit theorem applies equally to sums and means.

实际上,当数据呈单峰且不特别偏斜的分布时,(iii) 中关于样本量限制的条件并不重要。相反,只要样本量足够大,样本均值的分布无论数据的分布如何都将趋于正态。一般来说,数据越接近正态,重复抽样中均值近似正态分布的假设就越合理。如果我们能够假设均值服从正态分布,那么就可以轻松应用基于正态分布的方法(第4章介绍)来表示样本均值作为总体均值估计的不确定性。我将在下一节中回到这个问题。
In practice, the sample size restriction in (iii) is not relevant when the data have a distribution that is unimodal and not particularly asymmetric. Conversely, if the sample size is large enough the distribution of means will be Normal regardless of the distribution of the data. In general, the more Normal the data, the more reasonable will be the assumption that the mean will itself be Normally distributed in repeated sampling. If we can assume a Normal distribution for the mean it is easy to use the methods based on the Normal distribution (introduced in Chapter 4) to indicate the uncertainty of a sample mean as an estimate of the population mean. I shall return to this problem in the next section.

前述讨论主要涉及总体样本的均值,但(i)至(iii)的结论同样适用于样本比例。如果我们用1和0表示感兴趣属性的有无,例如是否切除过扁桃体,那么样本中具有该属性的比例即为具有该属性的个体数除以样本量。换言之,属性比例是样本中1和0的均值,因此上述性质(i)至(iii)同样适用。然而,由于总体中的取值肯定不是正态分布(仅为1或0),性质(iii)仅在大样本时适用。另一种看待比例的方法是,具有该属性的个体数(即样本量乘以)服从二项分布。如第4章所述,二项分布在样本量较大时趋近于正态分布。如果观察到的比例为,样本量为,那么决定其接近正态分布程度的是乘积的大小。
The preceding discussion has related to the mean of a sample from a population, but statements (i) to (iii) also apply to a sample proportion. If we give the values 1 and 0 to indicate the presence or absence of the attribute of interest, for example having had one's tonsils removed, the proportion with the attribute in a sample is the number with the attribute divided by the sample size. In other words, the proportion with the attribute is the mean of the 1s and 0s in the sample, and so properties (i) to (iii) above apply. However, as the population values are certainly not Normal, being either 1 or 0, property (iii) will apply only to large samples. Another way of looking at proportions is that the number with an attribute (which is equal to the sample size times ) will follow a Binomial distribution. As I mentioned in Chapter 4, the Binomial distribution becomes more like a Normal distribution for larger samples. If the observed proportion is and the sample size is , then it is in fact the magnitude of the product that determines the closeness to a Normal distribution.

8.3 样本均值分布的演示 8.3 A DEMONSTRATION OF THE DISTRIBUTION OF SAMPLE MEANS

通过从总体中抽取多个样本,观察均值或比例的分布,可以更直观地理解上述关于均值或比例分布的结论。由于难以找到合适的真实数据,我采用了第6章提到的计算机模拟技术来演示这一过程。
The truth of the above statements about the distribution of means or proportions estimated from several samples can best be appreciated by seeing what actually happens when many samples are taken from a population. It is not easy to find appropriate real data, so to demonstrate what happens I have used computer simulation, a technique mentioned in Chapter 6.

首先考虑总体分布为正态的情况。根据前节(i)和(ii),我们预期一组随机样本的均值也将服从正态分布,且所有样本均值的标准差应为总体标准差除以。这里的“预期”指的是平均意义上的结果—多组样本仍然会存在抽样变异。
First I shall consider the case where the distribution in the population is Normal. From (i) and (ii) in the previous section we expect that the means of a set of random samples will also have a Normal distribution, and we expect the standard deviation of all the sample means to be the population standard deviation divided by . As usual, by 'expect' we mean that this will happen on average - a set of several samples is still subject to sampling variation.

我以第4章讨论的原发性胆汁性肝硬化(PBC)患者研究为模拟基础。假设所有PBC患者的血清白蛋白值服从均值为、标准差为的正态分布。通过计算机模拟,从该正态分布中随机抽取样本量为10、25和100的样本,研究其均值的分布。图8.1展示了PBC患者总体血清白蛋白的理论正态分布及100个随机样本均值的直方图(样本量分别为10、25和100,直方图显示频数和相对频率)。100个均值的预期标准差分别为,即1.90、1.20和0.60。可以看到,观察到的分布较为接近正态分布,
I used the study of patients with primary biliary cirrhosis (PBC) discussed in Chapter 4 as the basis for the simulations. I supposed that among all patients with PBC, which is the population of interest here, serum albumin values have a Normal distribution with a mean of and a standard deviation of . I used computer simulation to study the distributions of samples of sizes 10, 25 and 100 drawn at random from this Normal distribution. Figure 8.1 shows the theoretical Normal distribution of serum albumin in the population of patients with PBC together with histograms of the means of 100 random samples of sizes 10, 25, and 100. (Note that as there were 100 samples the histograms show both frequencies and relative frequencies.) The expected standard deviations of the sets of 100 means are , and respectively, or 1.90, 1.20 and 0.60. It can be seen that the observed distributions are reasonably Normal,

尤其是样本量较大时,其均值和标准差均接近预期值。随着均值数量的增加,直方图将更接近正态分布。
especially for larger samples, and that their means and standard deviations are close to the expected values. The histograms will get nearer to a Normal distribution as the number of means increases.

前节性质(iii)指出,当样本足够大时,即使总体分布非正态,也应观察到类似现象。我们利用PBC试验中的血清胆红素数据进行模拟研究。实际胆红素值分布高度偏斜,均值为,标准差为,但其对数值近似正态分布,均值为3.55,标准差为1.03。假设所有PBC患者的血清胆红素对数值服从均值为、标准差为的正态分布。图8.2展示了相应的原始血清胆红素对数正态分布及从该明显偏斜分布中随机抽取样本量为10、25和100的样本的结果。可以看到,样本均值的分布随着样本量增加而更趋近于正态,但即使样本量为100,均值分布仍略显偏斜。总体值偏斜程度越大,均值近似正态分布所需的样本量越大。
Property (iii) in the previous section stated that for samples large enough we should observe a similar phenomenon even when the population values do not have a Normal distribution. We can study this effect using simulation based on the serum bilirubin data in the PBC trial. The actual bilirubin values had a highly skewed distribution with a mean of and a standard deviation of , but log serum bilirubin had an approximately Normal distribution, with a mean of 3.55 and standard deviation 1.03. I supposed that in the population of all PBC patients log serum bilirubin has a Normal distribution with a mean of and a standard deviation of . Figure 8.2 shows the corresponding Lognormal distribution of raw serum bilirubin values and the results of taking random samples of size 10, 25 and 100 from this markedly skewed distribution. We can see that the distribution of the sample means becomes more nearly Normal as the size of the sample increases, but even for samples of 100 the distribution of means is still slightly asymmetric. The more skewed the population values the larger the sample size needed for the means to have a near Normal distribution.

我们也可以用类似方法研究观察到的比例行为。根据全科医生的咨询数据,英格兰女性哮喘患病率约为0.20(即20%)(Fleming和Crombie,1987)。我们预期,随着样本量增加,一系列随机抽取的英格兰女性样本中哮喘患病比例的分布将趋于正态。
We can study the behaviour of observed proportions in a similar way. On the basis of general practitioner consultations it seems that the prevalence of asthma among women in England is about 0.20 (i.e. ) (Fleming and Crombie, 1987). We would expect that the observed proportions of asthma sufferers in a series of random samples of English women would tend to have a Normal distribution as the sample size is increased.

如第4章所述,样本中具有某属性的个体数服从二项分布。观察比例可视为均值,因此在重复的大样本中,样本比例的分布近似正态。我利用计算机模拟研究了当总体比例为0.2时样本比例的变异。图8.3显示了100个随机样本(样本量分别为10、25和100)中哮喘女性比例的分布。显然,随着样本量增加,分布确实更接近正态分布。二项分布趋近正态分布的速度取决于比例和样本量。比例越接近0或1,二项分布即使在较大样本下也越偏斜。
As discussed in Chapter 4, the number of subjects in a sample who have a particular attribute follows a Binomial distribution. The observed proportion can be considered to be a mean, and thus in repeated large samples we expect the distribution of the sample proportions to be approximately Normal. I used computer simulation to study the variation in the sample proportion when the population proportion is 0.2. Figure 8.3 shows the resulting distributions of the proportion of women suffering from asthma in 100 random samples of size 10, 25 and 100. It is clear that the distribution does indeed become more like a Normal distribution as the sample size increases. The speed with which the Binomial distribution resembles a Normal distribution depends upon the proportion and sample size. The nearer the proportion is to 0 or 1 the more asymmetric is the Binomial distribution even for quite large samples.

这些模拟从经验上验证了上一节中的三个陈述。实际上,我们几乎总是只有一个样本,但因为可以预测如果取多个样本会发生什么,所以我们可以利用单个样本的数值对总体做出有力推断,并量化不确定性。
These simulations have verified empirically the three statements in the previous section. In practice we nearly always have just a single sample, but because we can predict what would happen if many samples were taken we can use values from a single sample to make some strong inferences about the population, and can quantify the uncertainty.


图8.3 显示了在100个随机样本中,样本容量分别为10、25和100时,患哮喘女性的观察分布(概率为0.20)。
Figure 8.3 Observed distributions of the number of women with asthma (probability 0.20) in 100 random samples of sizes 10, 25, and 100.

8.4 估计 8.4 ESTIMATION

我将首先考虑从一组样本中测量数据并希望对总体均值做出结论的情况,然后再考虑与总体中某一感兴趣比例相关的同一问题。
I shall first consider the case where we have taken measurements from a sample of people and wish to draw conclusions about the mean of the population, and then consider the same problem relating to a proportion of interest in the population.

8.4.1 样本均值的标准误 8.4.1 Standard error of a sample mean

图4.5显示,216名PBC患者的血清白蛋白观察值分布接近正态分布。这些值的均值为34.46 g/l,标准差为5.84 g/l。我们能从这个单一样本推断出所有PBC患者总体的血清白蛋白值吗?显然,任何推断都必须基于样本能代表总体的假设,本文所有示例均假设如此。根据8.2节,我们对总体均值和标准差的最佳估计也分别是34.46和5.84 g/l。
Figure 4.5 showed that the distribution of the observed serum albumin values in 216 patients with PBC was close to a Normal distribution. The mean of these values was and the standard deviation was . What can we infer about serum albumin values in the population of all patients with PBC from this single sample? Clearly any inference must depend on our sample being representative of the population, and I shall make this assumption for all the examples in this section. From section 8.2 our best estimates of the mean and standard deviation in the population are also 34.46 and .

在上一节中我指出,多个样本均值的标准差为 ,其中 是总体标准差,这一点通过模拟得到了验证。样本均值的标准差是一个假设量,因为实际上我们只取一个样本,因此我们称其为均值的标准误(SEM)。虽然其他估计量也有标准误,但均值的标准误通常简称为标准误(SE),因为这样不会引起歧义。标准误这一名称提示了其含义:我们关心的是如何定量衡量均值估计与未知真实总体均值之间的误差大小。
In the previous section I stated that the standard deviation of many sample means will be , where is the standard deviation in the population, and this was demonstrated by simulation. The standard deviation of sample means is a hypothetical quantity, because in practice we take only a single sample, so we give it the different name of the standard error of the mean (SEM). Although there are other types of standard error associated with other estimates, the standard error of the mean is often abbreviated to standard error (SE) as it is not usually ambiguous to do so. The name standard error gives an indication of the interpretation, because we are interested to quantify in some way how good our estimate of the mean is of the true, and unknown, population mean - how large an error might we be making?

血清白蛋白样本均值的标准误为 。我们期望重复取相同大小样本的均值服从均值为34.46 g/l、标准差为0.397 g/l的正态分布。注意,标准误并不是总体中某个量的估计,而是多个样本均值之间变异性的指标,或者说是单一样本均值作为总体均值估计的不确定性度量。随着样本量增加,不确定性减小,这从公式中可见一斑,图8.1也对此进行了展示。在8.4.5节我将展示如何利用标准误构建置信区间。尽管标准误被广泛引用,但它本身是一个较少直接使用的量。
The standard error of the sample mean serum albumin is thus . We would expect the means of repeated samples of the same size to have a Normal distribution with mean and standard deviation . Note that the standard error is not an estimate of any quantity in the population, but an indication of the variability among many sample means or, alternatively, a measure of the uncertainty of a single sample mean as an estimate of the population mean. The uncertainty decreases as the sample size increases, as is apparent from the formula and was demonstrated in Figure 8.1. In section 8.4.5 I shall show how to use the standard error to construct a confidence interval. The standard error itself, although widely quoted, is a less useful quantity.

8.4.2 两个样本均值差异的标准误 8.4.2 Standard error of the difference between two sample means

大多数医学研究是比较性的,因此我们更常关注
Most medical research is comparative, and so we are more often concerned

使用两个或更多样本而非单个样本进行分析。比较两个样本尤其常见,为此我们需要知道两个样本均值差的标准误。
with two or more samples rather than a single sample. Comparing two samples is particularly common, and for this we need to know the standard error of the difference between the means of two samples.

在来自总体且标准差为 的单个样本中,均值的抽样分布方差为 ,因此均值的标准误为 。如果有两个独立样本,则它们均值差的方差是各自方差之和,因此均值差的标准误是各自方差之和的平方根。用数学符号表示,如果两个均值分别是 ,则
In a single sample from a population with a standard deviation of the variance of the sampling distribution of the mean is , and so the standard error of the mean is . If we have two independent samples the variance of the difference between their means is the sum of the separate variances, so the standard error of the difference in means is the square root of the sum of the separate variances. In mathematical notation, if the two means are and , then

例如,一项关于急性心肌功能的大型研究发现,1551名男性的平均血尿素氮为 (标准差13),而538名女性的平均值为 (标准差15)(Dittrich 等,1988)。差值为 ,其标准误为
For example, a large study of acute myocardial function found that 1551 men had a mean blood urea nitrogen of (SD 13) while among 538 women the mean was (SD 15) (Dittrich et al., 1988). The difference is , and its standard error is

标准误可用于构建两个独立样本连续变量均值差的置信区间,前提是样本量较大(参见第8.4.5节)。对于小样本,将采用稍有不同的方法,详见第9章。
The standard error can be used to construct a confidence interval for the difference in the means of two independent samples of values of a continuous variable if the samples are large (see section 8.4.5). For small samples a slightly different approach is used, as will be described in Chapter 9.

8.4.3 样本比例的标准误 8.4.3 Standard error of a sample proportion

我曾指出,样本比例在大样本中近似服从正态分布。因此,在样本量足够大的假设下,我们可以通过计算样本比例的标准误来进行近似。如前所述,对于 ,即使样本较小,近似也相当准确。当 均大于 时,使用此近似是合理的。例如,对于样本量大于约50的比例在0.1到0.9范围内,近似效果良好;但对于超出此范围的 ,则需要更大的样本量。
I showed that a sample proportion will have an approximately Normal distribution in large samples. We can thus make an approximation by calculating the standard error of a sample proportion under the assumption that the sample size is large enough. As we have seen, for the approximation is quite good even for fairly small samples. It is reasonable to use this approximation when and are greater than . For example, the approximation is good for proportions in the range 0.1 and 0.9 for samples greater than about 50, but for values of outside this range a larger sample is required.

在第4章中给出了二项分布比例 在样本量为 时的标准误为 。利用正态近似,我们预期若总体比例为 ,则同样大小的重复样本中观察到的比例将服从均值为 、标准差为 的正态分布。
The standard error of the Binomial proportion in a sample of size was given in Chapter 4 as . Using the Normal approximation we thus expect that if the population proportion is then in repeated samples of the same size the observed proportions will have a Normal

回到前面的例子,如果在随机抽取的80名女性样本中观察到13人患有哮喘,则我们估计总体中患哮喘女性的比例为 ,其标准误为
distribution with mean and standard deviation . Returning to the earlier example, if we observe that 13 of a random sample of 80 women have asthma, then from that sample we would estimate that the proportion of women in the population with asthma is , with a standard error of .

8.4.4 两个比例差的标准误 8.4.4 Standard error of the difference between two proportions

我们可以用与第8.4.2节中两个均值差的标准误相同的方法来计算两个比例差的标准误。如果我们有来自两个独立样本的两个观察比例,,那么它们差值 的标准误为
We can calculate the standard error of the difference between two proportions in the same manner as that of the difference between two means given in section 8.4.2. If we have two observed proportions, and , from two independent samples, then the standard error of their difference, , is given by

例如,在一项针对青少年的大型研究中,712名男孩中有165人报告他们总是使用安全带,而641名女孩中有91人如此(Maron等,1986)。两个比例分别是0.232和0.142,因此比例差为0.090。差值的标准误为
For example, in a large study of adolescents 165 of 712 boys reported that they always used a seat belt compared with 91 of 641 girls (Maron et al., 1986). The two proportions are 0.232 and 0.142, so the difference in proportions is 0.090. The standard error of the difference is

8.4.5 置信区间 8.4.5 Confidence intervals

我在第8.2节中指出,样本中观察到的均值或比例是总体“真实”值的最佳估计,并且对于大样本,从多个样本获得的值的分布大致呈正态分布。我们可以将样本估计的这些特性与正态分布的已知性质结合起来,了解单一样本估计总体值时的不确定性。我们通过构建置信区间来实现这一点,置信区间是一个我们有信心包含真实值的数值范围。基本思想是,置信区间覆盖了感兴趣统计量的抽样分布的大部分。
I observed in section 8.2 that the mean or proportion observed in a sample is the best estimate of the 'true' value in the population, and that the distribution of the values obtained in several samples would be approximately Normal for large samples. We can combine these features of estimates from a sample with the known properties of the Normal distribution to get an idea of the uncertainty associated with a single sample estimate of the population value. We do this by constructing a confidence interval, which is a range of values which we can be confident includes the true value. The basic idea is that the confidence interval covers a large proportion of the sampling distribution of the statistic of interest.

估计均值的置信区间在均值两侧延伸若干倍的标准误。例如,均值减去3倍标准误到均值加上3倍标准误的区间是一个99.7%的置信区间,因为从正态分布中取值距离均值3个或更多标准差的概率为(如第4.5节和表B2所示)。
A confidence interval for the estimated mean extends either side of the mean by a multiple of the standard error. For example, the interval between mean - 3SE and mean + 3SE will be a 99.7% confidence interval. because the probability of getting a value from a Normal distribution three

最常计算的是95%的置信区间,即从均值减去1.96倍标准误到均值加上1.96倍标准误的范围。然而,选择95%置信水平只是惯例,偶尔也会使用80%、90%和99%的置信水平。
or more standard deviations from the mean is (as shown in section 4.5 and Table B2). It is most common to calculate a confidence interval, which is the range of values from mean - 1.96SE to mean +1.96SE. However, there is no particular reason for choosing other than convention, and levels of , and are sometimes used.

我们预期95%的置信区间有5%的概率不包含真实的总体值。我们可以通过使用例如99%的置信区间来提高包含总体均值的概率,但代价是区间更宽,从而不确定性更大。重要的是,无论样本大小如何,从单一样本构建的置信区间都有小概率不包含真实总体均值。
We expect that the confidence interval will not include the true population value of the time. We can improve the probability of including the population mean by using, say, a confidence interval, but at the cost of having a wider interval and thus greater uncertainty. The important point is that there is a small chance that the confidence interval constructed from a single sample will not include the true population mean, whatever the sample size.

样本均值的 置信区间通常被解释为一个包含总体真实均值的区间,其概率为 0.95。我们因此预期,如果对图 8.1 中显示的 100 个随机样本分别计算血清白蛋白的 置信区间,大约有 的区间不会包含 这个值。图 8.4 显示了基于样本量为 100 的 100 个置信区间,其中有七个不包含 。图 8.5 显示了按样本均值大小排序的置信区间,我们可以看到有七个样本均值落在我们预期包含 样本均值的范围之外。该范围是通过下式计算的:
The confidence interval for the sample mean is usually interpreted as a range of values which contains the true population mean with probability 0.95. We thus expect that if we calculate a confidence interval for the mean serum albumin using each of the 100 random samples shown in Figure 8.1 we would find that about of them would not include the value of . Figure 8.4 shows all 100 confidence intervals based on samples of size 100 of which seven do not include . Figure 8.5 shows the confidence intervals sorted by the size of the sample mean and we can see that seven sample means fall outside the range within which we expect of sample means. This range is calculated using the


图 8.4 由 100 个样本量为 100 的随机样本构建的血清白蛋白均值置信区间。垂直线显示了 的样本均值预期落入的范围。
Figure 8.4 Confidence intervals for mean serum albumin constructed from 100 random samples of size 100. The vertical lines show the range within which of sample means are expected to fall.


图 8.5 图 8.4 中置信区间按随机样本均值大小排序。
Figure 8.5 Confidence intervals from Figure 8.4 ordered by the magnitude of the mean of the random sample.

总体均值和标准差计算得到均值的置信区间为 ;即 ,范围为 33.8 到 36.2。观察到的 与预期的 之间的差异无关紧要—我们不应期望恰好观察到
population mean and standard deviation to get mean ; that is or 33.8 to 36.2. The difference between the observed and the expected is of no importance - we would not expect to observe exactly .

在 PBC 试验中,我们实际观察到 216 名原发性胆汁性肝硬化患者的平均血清白蛋白为 ,标准误为 。因此, 置信区间为 ,即 33.68 到 。我们可以以 的置信度认为,该研究中所有此类患者的真实平均血清白蛋白值位于 33.68 到 之间,34.46 是我们的最佳估计值。如前所述,这一解释依赖于这 216 名患者样本具有代表性。
In the PBC trial we actually observed a mean serum albumin of with a standard error of from a sample of 216 patients with primary biliary cirrhosis. The confidence interval is thus given by the range of values from to , or from 33.68 to . We can thus be confident from this study that the true mean serum albumin among all such patients lies somewhere in the range 33.68 to , with 34.46 as our best estimate. As mentioned earlier, this interpretation depends on the assumption that the sample of 216 patients is representative of all patients with the disease.

同样,这 216 名 PBC 患者的血清胆红素值近似呈对数正态分布。我们可以依靠中心极限定理,使用与血清白蛋白相同的方法计算血清胆红素的均值置信区间。然而,由于血清胆红素分布高度偏斜,我们更关心中位数而非均值。因此,更有用的置信区间应为中位数置信区间,或者计算对数血清胆红素均值的置信区间,再通过反变换得到几何均值的置信区间。这些方法将在下一章中描述。
The same 216 patients with PBC had serum bilirubin values that had an approximately Lognormal distribution. We could calculate a confidence interval for the mean serum bilirubin by relying on the central limit theorem and using the same method as for serum albumin. However, because the distribution of serum bilirubin is highly skewed we would be more interested in the median rather than the mean. A more useful confidence interval would therefore be for the median, or we could calculate a confidence interval for the mean of the log serum bilirubin values and back- transform these to give a confidence interval for the geometric mean. These methods are described in the next chapter.

类似地,我们可以为 80 名女性样本构建 置信区间,其中观察到哮喘比例为 0.16,标准误为 0.039。该样本比例的 置信区间为 ,即 0.08 到 0.24。因此,我们以 的置信度认为,基于该样本,英国女性哮喘比例位于 0.08 到 0.24 之间。置信区间较宽是因为样本量 80 对估计比例来说较小。相比之下,男孩和女孩始终使用安全带的比例差异的 置信区间较窄,因为研究规模较大。比例差异为 0.090,标准误为 0.0210,因此 置信区间为 ,即 0.05 到 0.13。这些构建置信区间的例子均基于大样本下的正态分布理论。后续章节中,我们将对连续数据分析采用 分布而非正态分布,但对比例仍使用正态分布。通过在估计值上加减其标准误的倍数来构建置信区间的通用原则几乎适用于所有情况。
Similarly, we can construct a confidence interval for our sample of 80 women among whom the observed proportion with asthma was 0.16 with a standard error of 0.039. A confidence interval for the sample proportion is from to , or from 0.08 to 0.24. We are thus confident that on the basis of this sample the proportion of English women with asthma lies in the range 0.08 to 0.24. The confidence interval is wide because the sample size of 80 is rather small for estimating a proportion. In contrast, a confidence interval for the difference in the proportions of boys and girls always using seat belts is narrower because the study was large. The difference in proportions was 0.090 and its standard error was 0.0210, so the confidence interval is from to , or from 0.05 to 0.13. These examples illustrating the construction of confidence intervals have made use of Normal distribution theory applied to large samples. In later chapters we will use the distribution rather than the Normal distribution for analysis of continuous data, but use the Normal distribution for proportions. The general principle of constructing a confidence interval by adding to or subtracting from an estimate a multiple of its standard error applies in nearly all cases.

这些构建置信区间的例子均基于大样本下的正态分布理论。后续章节中,我们将对连续数据分析采用 分布而非正态分布,但对比例仍使用正态分布。通过在估计值上加减其标准误的倍数来构建置信区间的通用原则几乎适用于所有情况。
These examples illustrating the construction of confidence intervals have made use of Normal distribution theory applied to large samples. In later chapters we will use the distribution rather than the Normal distribution for analysis of continuous data, but use the Normal distribution for proportions. The general principle of constructing a confidence interval by adding to or subtracting from an estimate a multiple of its standard error applies in nearly all cases.

许多统计分析旨在估计一个或多个感兴趣的量。本章讨论了均值和比例,但相同的思想也适用于其他量的估计。计算感兴趣估计量的标准误后,即可得到置信区间。
Much statistical analysis aims to estimate one or more quantities of interest. Means and proportions have been discussed in this chapter, but the same ideas apply to estimates of other quantities. The standard error of the estimate of interest is calculated, from which one obtains a confidence interval.

8.5 假设检验 8.5 HYPOTHESIS TESTING

前面几节中概述的方法看似非常直接,因此大多数医学统计分析并非采用这种形式,而是基于一种不同且不那么直观的方法,称为假设检验。大多数统计分析涉及比较,最明显的是治疗方法或程序之间,或受试者组之间的比较。与感兴趣的比较对应的数值通常称为效应。我们可以提出一个假设,称为零假设,即感兴趣的效应为零,例如男性和女性的平均血清胆固醇相同,或两种头痛治疗方法同样有效。这个统计零假设通常是否定产生数据的研究假设。在第一个例子中,研究假设可能是男性和女性在血清胆固醇水平上存在差异。我们还有一个备择假设,通常是感兴趣的效应不为零。设定零假设后,我们接着评估
The approach outlined in the preceding sections seems so straightforward that it may come as some surprise that most statistical analysis in medicine is not of this form, but is based on a different and less intuitive approach called hypothesis testing. The majority of statistical analyses involve comparison, most obviously between treatments or procedures or between groups of subjects. The numerical value corresponding to the comparison of interest is often called the effect. We can state a hypothesis called the null hypothesis that the effect of interest is zero, for example that serum cholesterol is the same on average for men and women or that two treatments for headache are equally effective. This statistical null hypothesis is often the negation of the research hypothesis that generated the data. In the first example, the research hypothesis might be that there was a difference between men and women with respect to their serum cholesterol levels. We also have an alternative hypothesis, which is usually simply that the effect of interest is not zero. Having set up the null hypothesis, we then evaluate the probability that

设定零假设后,我们接着评估
Having set up the null hypothesis, we then evaluate the probability that

如果零假设为真,我们获得观察到的数据(或更极端的数据)的概率。这一概率通常称为值;值越小,零假设越站不住脚。之所以称为“检验”,是因为涉及决定是否拒绝零假设的过程。例如,在一项比较男女血清胆固醇水平的研究中,可能发现男性水平有升高趋势,值为0.10。注意,此方法中没有直接涉及效应大小:分析结果以概率值总结。基于估计和置信区间的方法因这些原因被广泛认为更优,但假设检验仍是重要的统计方法,理解其基本原理和解释至关重要。第7.5.3节描述的Shapiro-Wilk非正态性检验即是假设检验的例子。
we could have obtained the observed data (or data that were more extreme) if the null hypothesis were true. This probability is usually called the value; the smaller it is the more untenable is the null hypothesis. The method is called testing because of the aspect of deciding whether or not we can reject the null hypothesis. We might find, for example, that in a study comparing serum cholesterol levels of men and women, there was a tendency for higher levels in men, and the value was 0.10. Notice that there is no direct reference in this method to the magnitude of the effect of interest: the analysis is summarized by a probability value. For this and other reasons the approach based on estimation and confidence intervals is widely considered superior, but hypothesis testing remains an important statistical method, and it is essential to understand the underlying principles and interpretation. The Shapiro- Wilk test for non- Normality, described in section 7.5.3, is an example of a hypothesis test.

我们如何评估零假设为真时获得数据的概率?本书讨论的大多数问题的答案是计算检验统计量—一个可与零假设为真时预期分布比较的值。检验统计量的一般形式可表示为观察值与零假设为真时预期值的关系。观察值是感兴趣的估计量,例如男性和女性血清胆固醇均值差。对于目前描述的情况,检验统计量为
How do we evaluate the probability of obtaining our data if the null hypothesis is true? For most of the problems discussed in this book the answer lies in calculating a test statistic - a value which we can compare with the known distribution of what we expect when the null hypothesis is true. The general form of the test statistic can be expressed in relation to the observed value of the quantity of interest and the value expected if the null hypothesis were true. The observed value is the estimate of interest, such as the difference in mean serum cholesterol between men and women. For the situations so far described the test statistic is given by

在许多情况下,假设值为零,因此检验统计量变为观察值与其标准误的比值。将感兴趣量的大小评估为其标准误的倍数是主要统计分析方法中的常见思路。然而,后续章节将讨论一些检验统计量不符合上述形式的情况。
In many cases the hypothesized value is zero, so that the test statistic becomes the ratio of the observed quantity of interest to its standard error. The idea that the magnitude of the quantity of interest is evaluated as a multiple of its standard error is common in the main methods of statistical analysis. However, there are several situations discussed in later chapters where the test statistic is not of the above form.

在后续章节讨论的某些情况下,当零假设为真时,检验统计量可视为服从正态分布。其他情况下,尤其是研究均值时,需要使用略有不同的分布,但原理相同。
In some circumstances discussed in later chapters we will see that when the null hypothesis is true the test statistic can be considered to have a Normal distribution. In other cases, notably when studying means, we need to use the slightly different distribution, but the principle is the same.

我们通过计算零假设为真时观察到该统计量值或更极端(即更不可能)值的概率来评估检验统计量。感兴趣的概率,即值,是分布的尾部面积。举例说明,假设检验统计量在零假设为真时服从正态分布。假设我们用216名PBC患者样本评估零假设:所有PBC患者的平均血清白蛋白为。如前所示,样本平均血清白蛋白为
We evaluate a test statistic by calculating the probability that we could have observed that value, or one that is more extreme (i.e. more unlikely). if the null hypothesis is true. The probability of interest, or value, is thus the tail area of the distribution. As an example, I shall consider the case where the test statistic has a Normal distribution when the null hypothesis is true. Suppose we wish to use the sample of 216 PBC patients to evaluate the null hypothesis that the mean serum albumin in all PBC patients is . As shown earlier, the mean serum albumin in the sample was

【34】46 g/l,标准误为。这是可以使用上述公式的情况,因此计算检验统计量为(34.46 - 33.5)/0.397,结果为2.42。根据表B1,正态分布对应该检验统计量的尾部面积为0.0078,即。然而,检验统计量可能为负,分布另一尾部的对应值在零假设为真时同样极端或不可能,因此将面积乘以2,得到值为0.0155。该值可直接从表B2获得。换言之,如果零假设为真,检验统计量达到2.42或更大值的概率仅为0.0155。显然,这是一种双尾检验。是否使用双尾或单尾检验的问题在第8.5.6节讨论。我们可以对第8.4节描述的所有能计算置信区间的情况进行假设检验,这通常成立。后续章节将展示某些情况下可进行假设检验但无法获得置信区间。
34.46 g/l and its standard error was . This is a situation where we can use the formula given above, so we calculate the test statistic as (34.46 - 33.5)/0.397, which is 2.42. From Table B1 the tail area of the Normal distribution corresponding to this value of the test statistic is 0.0078, or . However, the test statistic could be negative, and the equivalent values in the other tail of the distribution are just as extreme, or unlikely, when the null hypothesis is true so we double the area to get a P value of 0.0155. This value can be obtained directly from Table B2. In other words, a test statistic of 2.42 or more would arise with a probability of only 0.0155 if the null hypothesis is true. We call this a two- tailed test, for obvious reasons. The question of whether to use a two- tailed or a one- tailed test is discussed in section 8.5.6. We can carry out a hypothesis test for all the situations described in section 8.4 where we can calculate a confidence interval, and this is true in general. In later chapters, however, we will see that there are some circumstances where we can perform a hypothesis test but cannot obtain a confidence interval.

我们可以对第8.4节描述的所有能计算置信区间的情况进行假设检验,这通常成立。后续章节将展示某些情况下可进行假设检验但无法获得置信区间。
We can carry out a hypothesis test for all the situations described in section 8.4 where we can calculate a confidence interval, and this is true in general. In later chapters, however, we will see that there are some circumstances where we can perform a hypothesis test but cannot obtain a confidence interval.

8.5.1 P值的解释 8.5.1 Interpretation of P values

P 值在医学研究论文中随处可见,因此准确理解它们的含义以及它们不代表什么至关重要。P 值是在原假设为真时,观察到我们数据(或更极端的数据)的概率。例如,在临床试验中,这句话指的是治疗组之间观察到的差异。因此,我们将数据与在总体中原假设为真时样本因偶然产生的可能变异联系起来。
P values abound in medical research papers, so it is essential to understand precisely what they mean, and also what they do not mean. The P value is the probability of having observed our data (or more extreme data) when the null hypothesis is true. For example, in a clinical trial this statement refers to the observed difference between the treatment groups. We are therefore relating our data to the likely variation in a sample due to chance when the null hypothesis is true in the population.

我们已经看到,样本的结果与总体的真实情况存在差异,且样本间的变异性随着样本量的增加而减少。后续章节将展示这些事实在计算检验统计量及其对应的 P 值时被考虑在内。
We have seen that samples give results that differ from what is true in the population, and that the variability among samples decreases as the sample size increases. It will be seen in subsequent chapters that these facts are taken into account when test statistics, and hence P values, are calculated.

P 值的解释存在问题。如果我们进行一项临床试验比较两种治疗,得到一个“较大”的 P 值,比如大于 0.2,那么我们可以说当原假设真实成立时,像我们这样的数据经常会出现。因此,我们不能排除原假设为真的可能性—即两种治疗效果相同。相反,如果 P 值非常小,比如小于 0.001,那么原假设看起来不太可能成立,因为当原假设为真时,我们的数据几乎不可能仅由偶然产生。因此,我们可以有信心认为原假设不成立,一种治疗优于另一种。在这两个极端之间存在一个灰色地带,但通常会选择一个临界值,如果 P 值小于该临界值,则拒绝原假设。原假设检验的依据就是 P 值是否
The interpretation of a P value is problematic. If we carry out a clinical trial to compare two treatments and get a 'large' value of P, say greater than 0.2, then we can say that data such as ours could occur often when the null hypothesis is really true. We thus cannot rule out the possibility that the null hypothesis is true - that is, that the two treatments are equally effective. Conversely if P is very small, say less than 0.001, then the null hypothesis appears implausible because our data could hardly ever arise purely by chance when the null hypothesis is true. We can therefore feel confident that the null hypothesis is not true and one treatment is superior. Between these two extremes lies a grey area, but conventionally a cut- off is chosen and if P is smaller than the cut- off value the null hypothesis is rejected. The test of the null hypothesis is therefore whether

低于所选的临界点。
or not P lies below the chosen cut- off point.

虽然临界值的选择是任意的,但在实际中大多数情况下我们使用 0.05。换句话说,当原假设为真时,出现的结果少于 1/20 的概率将导致拒绝原假设。在这种表述中,当我们拒绝原假设时,我们接受一个互补的备择假设,在临床试验的例子中,即两种治疗效果不相等。如果 P 值超过临界值,我们不拒绝原假设。然而,我们不能说我们相信原假设是真的,只能说没有足够证据拒绝它。这是一个微妙但重要的区别。
Although the choice of cut- off is arbitrary, in practice in most cases we use 0.05. In other words, an outcome that could occur less than one time in 20 when the null hypothesis is true would lead to the rejection of the null hypothesis. In this formulation, when we reject the null hypothesis we accept a complementary alternative hypothesis, which in the clinical trial example is that the two treatments are not equally effective. If the P value exceeds the critical value we do not reject the null hypothesis. However, we cannot say that we believe the null hypothesis is true, but only that there is not enough evidence to reject it. This is a subtle but important distinction.

当 P 值低于临界值,比如 0.05,结果称为统计学显著(而低于更低的水平,如 0.01,可能称为高度显著);当高于 0.05 时称为不显著。因此,假设检验常被称为显著性检验。显著一词的使用导致统计显著性和临床显著性之间的混淆。由于假设检验的广泛使用,一些医学期刊限制显著一词仅用于统计意义。然而,通常的做法是将统计显著结果视为真实效应,并且常常暗示其临床重要性。但这两种解释都不一定成立。例如,在第 5.4 节描述的比较左右臂血压的研究中(Gould 等,1985),发现了约 1 mmHg(收缩压和舒张压均如此)的微小差异。该差异统计学上高度显著,但临床上无意义。同样,仅因为我们不能排除原假设,也不合理认为非显著结果表示无效应。
When is below the cut- off, say 0.05, the result is called statistically significant (and below some lower level, such as 0.01, it may be called highly significant); when above 0.05 it is called not significant. For this reason hypothesis tests are often called significance tests. The use of the word significant leads to much confusion between statistical and clinical significance. Because of the widespread use of hypothesis tests some medical journals restrict the use of the word significant to its statistical meaning. However, it is common practice to take a statistically significant result as a real effect, and often, by implication, as a clinically important effect too. Neither interpretation is necessarily justified. For example. in the study to compare blood pressure in the left and right arms described in section 5.4, a small difference of about (both systolic and diastolic) was found (Gould et al., 1985). This difference was highly statistically significant but of no importance clinically. Similarly it is not reasonable to take a non- significant result as indicating no effect. just because we cannot rule out the null hypothesis.

8.5.2 P 作为显著性水平 8.5.2 P as a significance level

统计显著性的临界值通常取 0.05,有时取 0.01。这些临界值是任意的,没有特殊意义。根据 P 值是 0.055 还是 0.045 来不同解释研究结果是荒谬的。两者的 P 值应得出非常相似的结论,而非截然相反。数据的轻微变化很容易使 P 值变化如此之大或更多。
The cut- off level for statistical significance is usually taken at 0.05. but sometimes at 0.01. These cut- offs are arbitrary and have no specif. importance. It is ridiculous to interpret the results of a study differentl. according to whether the value obtained was, say, 0.055 or 0.045. The. values should lead to very similar conclusions, not diametrically opposed ones. A minor change to the data can easily shift the value by this amount or more.

近年来,人们逐渐摒弃将 P 值简单划分为显著或不显著(基于任意的 0.05 界限),转而报告实际的 P 值。现在越来越常见的表达是 P = 0.02 或 P = 0.15,而非 P < 0.05 或 P > 0.05。原因之一是许多统计软件直接给出精确的 P 值,而过去需要从表格中根据检验统计量查找对应的 P 值。
In recent years there has been a welcome move away from regarding the value as significant or not significant, according to which side of the arbitrary 0.05 value it is, towards quoting the actual value. It is increasingly common to see expressions such as or rather than or . One reason for this is that many statistical computer programs give the exact value, whereas it used to be necessary to evaluate a value from tables in which test statistics were given.

这些表格只对应某些特定的 P 值,如 0.1、0.05、0.01 和 0.001(表 B3 即此类)。报告实际的 P 值允许读者自行解读。
corresponding to certain values only, such as 0.1, 0.05, 0.01 and 0.001. (Table B3 is of this type.) Quoting the actual value allows the reader to make his or her own interpretation.

但是,如果不以0.05水平来解读 值,该如何解释呢?对此问题并无真正令人满意的答案,因为 值本身就是一种不自然的结果表达方式。在第8.8节,我对假设检验与置信区间估计进行了对比,并解释了为何后者更受推崇。
But how does one interpret values if not in relation to the 0.05 level? There is no really satisfactory answer to this question, because values are an unnatural way of expressing results. In section 8.8 I contrast hypothesis testing and estimation via confidence intervals, and explain why the latter are greatly preferred.

8.5.3 第一类和第二类错误 8.5.3 Type I and Type II errors

使用 的临界值导致将分析视为一个决策过程。在此框架下,通常(但不明智地)认为统计显著的效应是真实存在的,反之,非显著结果则表示无效应。强行在显著与非显著之间做出选择掩盖了从样本推断时存在的不确定性。构建置信区间时,这种不确定性是明确显示的,而假设检验中则是隐含的,且容易被忽视。
The use of a cut- off for leads to treating the analysis as a process for making a decision. Within this framework it is customary (but unwise) to consider that a statistically significant effect is a real one, and conversely that a non- significant result indicates that there is no effect. Forcing a choice between significant and non- significant obscures the uncertainty present whenever we draw inferences from a sample. When we construct a confidence interval the uncertainty is shown explicitly, but with a hypothesis test it is implicit, and may easily be overlooked.

使用 值做决策时可能犯两种错误。首先,当原假设真实成立时却得到显著结果并拒绝原假设,这称为第一类错误,可视为“假阳性”结果。其次,当原假设不真实时却得到非显著结果,这称为第二类错误,可视为“假阴性”发现。
Two possible errors can be made when using to make a decision. Firstly, we can obtain a significant result, and thus reject the null hypothesis, when the null hypothesis is in fact true. This is called a Type I error, and may be thought of as a 'false positive' result. Alternatively, we may obtain a non significant result when the null hypothesis is not true, in which case we make a Type II error. This can be thought of as a 'false negative' finding.

第一类和第二类错误的概率有时分别称为α(alpha)和β(beta)。对于任何假设检验,α值事先确定,通常为5%。β值取决于感兴趣效应的大小及样本量。我们更常谈论研究检测特定效应大小的能力,即统计功效,定义为 ,或 。置信区间宽泛通常表明功效较低。
The probabilities of Type I and Type II errors are sometimes called alpha and beta . For any hypothesis test the value of alpha is determined in advance, usually as . The value of beta depends upon the size of effect that one is interested in, and also the sample size. More often we talk about the power of a study to detect an effect of a specified size, where the power is , or . A wide confidence interval is an indication of low power.

也可以通过选择合适的样本量预先固定β值。换言之,可以计算出研究所需的样本量,以便有较高概率发现给定大小的真实效应。第15章展示了两组比较研究的样本量计算方法。对于更复杂的设计,建议咨询统计学家以确定样本量。
We can also fix beta in advance by choosing an appropriate sample size. In other words, we can calculate the necessary sample size for a study to have a high probability of finding a true effect of a given magnitude. Chapter 15 shows how to perform the calculations for studies comparing two groups. For more complicated designs it is advisable to get advice on sample size from a statistician.

8.5.4 过度依赖 值 8.5.4 Over-reliance on values

将统计分析表述为两种可能结果—显著或非显著—的检验,对医学文献产生了负面影响。越来越多证据显示,发表偏倚倾向于支持显著发现的论文。
The formulation of statistical analysis as a test with two possible outcomes - significant or not significant - has had harmful effects on the medical literature. There is increasing evidence of publication bias in favour of

若进行多项相同研究,结果因抽样变异而不同。显示较大效应的研究更可能达到统计显著,也更可能被发表。即使原假设为真,也会有1/20的研究在5%水平显著。结果是,已发表研究是所有研究的有偏选择(参见第15.5.2节)。
papers reporting significant findings. If several identical studies are performed their results will vary because of sampling variation. Those studies that show larger effects will be more likely to be statistically significant and thus more likely to be published. The same applies even when the null hypothesis is true, as we know that one study in 20 will give a result significant at the level. The consequence is that published studies are a biased selection of all studies carried out (see section 15.5.2).

统计显著常被视为成功,非显著则被视为失败。这体现在用“阳性”和“阴性”来描述显著与非显著研究结果,这种用法应当废止。同样不妥的是许多论文中出现的丑陋表达“未达到统计显著”。
The achievement of statistical significance is often seen as success and a non- significant result as failure. This is exemplified by the use of the terms 'positive' and 'negative' to describe studies with significant or non- significant results, a usage that should be abandoned. The same attitude is also seen in the ugly phrase 'failed to reach statistical significance' which is seen in many papers.

Freiman 等人(1978)研究了71个发表的“阴性”试验,这些试验的 值大于0.1,并为每项研究构建了置信区间。他们发现,近一半的试验结果与治疗效果提升 是兼容的,而我们可以合理地认为这是任何试验中具有临床价值的。换句话说,置信区间足够宽,包含了一种治疗比另一种治疗好 的可能性。在原始论文中,没有一篇作者构建了置信区间。另一种看待这些试验的方式是,它们的统计功效低,样本量太小。由于标准误差与样本量有关,小样本研究可能无法检测到真实存在的差异(作为显著差异)。这些试验展示了统计显著性与临床重要性之间的不等价性。
Freiman et al. (1978) looked at 71 published trials with 'negative' results, defined as having values greater than 0.1, and constructed confidence intervals for each study. They found that for nearly half the trials the results were compatible with a therapeutic improvement, which we may reasonably take as clinically valuable for any trial. In other words, the confidence intervals were wide enough to include the possibility that one treatment was better than the other. In none of the original papers had the authors constructed a confidence interval. Other ways of looking at these trials are that they had low power and that the sample size was too small. Because the standard error is related to sample size, a small study may fail to detect (as significant) a difference that is real. These trials demonstrate the non- equivalence of statistical significance and clinical importance.

8.5.5 P 值的误解 8.5.5 Misinterpretation of P values

值的一个常见误解是,它是数据偶然产生的概率,或者等价地, 是观察到的效应不是真实效应的概率。这个错误定义与之前给出的正确定义的区别在于缺少了“当原假设成立时”这句话。这种遗漏导致了错误的信念,认为可以评估观察到的效应是真实效应的概率。样本中观察到的效应是真实存在的,但我们不知道总体中的真实情况。使用这种统计分析方法,我们只能计算在原假设成立时观察到我们的数据(或更极端数据)的概率。
A common misinterpretation of the value is that it is the probability of the data having arisen by chance or, equivalently, that is the probability that the observed effect is not a real one. The distinction between this incorrect definition and the true definition given earlier is the absence of the phrase when the null hypothesis is true. The omission leads to the incorrect belief that it is possible to evaluate the probability of the observed effect being a real one. The observed effect in the sample is genuine, but we do not know what is true in the population. All we can do with this approach to statistical analysis is to calculate the probability of observing our data (or more unlikely data) when the null hypothesis is true.

8.5.6 双侧还是单侧 P 值? 8.5.6 Two-sided or one-sided P values?

再次强调, 值是在原假设成立时,获得至少与观察结果一样极端的结果的概率。我指出
To reiterate, the value is the probability of obtaining a result at least as extreme as the observed result when the null hypothesis is true. I pointed

早先已经指出,极端结果可能因偶然因素在任一方向上同样频繁出现,我们通过计算双侧 值来考虑这一点。在绝大多数情况下,这是一种正确的做法。在极少数情况下,合理地认为真实差异只能出现在一个方向上,因此观察到的相反方向的差异必须归因于偶然。在这里,备择假设仅限于单方向效应,计算单侧 值时只考虑检验统计量分布的一个尾部。对于服从正态分布的检验统计量,通常的双侧 临界值是1.96,而单侧 临界值为1.64。两者差异不大,但在固定统计显著性水平下可能导致不同的解释。
out earlier that extreme results can occur by chance equally often in either direction, which we allow for by calculating a two- sided value. In the vast majority of cases this is the correct procedure. In rare cases it is reasonable to consider that a real difference can occur in only one direction, so that an observed difference in the opposite direction must be due to chance. Here the alternative hypothesis is restricted to an effect in one direction only, and it is reasonable to calculate a one- sided value by considering only one tail of the distribution of the test statistic. For a test statistic with a Normal distribution the usual two- sided cut- off point is 1.96, whereas a one- sided cut- off is given by 1.64. The difference is not particularly large but can lead to a different interpretation in relation to fixed levels of statistical significance.

单侧检验很少适用。即使我们有强烈的先验预期,比如新治疗不可能比旧治疗更差,我们也不能确定自己是对的。如果能确定,就不需要做实验了!如果确实认为单侧检验合适,这一决定必须在数据分析前作出,不能依赖于结果。已发表论文中报道的少数单侧检验通常得到的 值介于0.025和0.05之间,因此若用双侧检验则结果不显著。我怀疑大多数并非预先计划的单侧检验。
One- sided tests are rarely appropriate. Even when we have strong prior expectations, for example that a new treatment cannot be worse than an old one, we cannot be sure that we are right. If we could be sure we would not need to do an experiment! If it is felt that a one- sided test really is appropriate, then this decision must be made before the data are analysed; it must not depend on what the results were. The small number of one- sided tests that I have seen reported in published papers have usually yielded values between 0.025 and 0.05, so that the result would have been non- significant with a two- sided test. I doubt that most of these were pre- planned one- sided tests.

第8.8节将比较估计法和假设检验法。两者关系密切,但仅适用于双侧假设检验。本书将始终使用双侧 值,并建议常规采用。在某些地方,我引用了比附录B表格更精确的值,许多计算机程序能给出精确的 值。
The estimation and hypothesis testing approaches will be compared in section 8.8. There is a close relation between the two, but only for a two- sided hypothesis test. Two- sided values will be used throughout this book, and I recommend that they are used routinely. In some places I quote more exact values than can be obtained from the tables in Appendix B. Many computer programs give exact values.

8.6 非参数方法 8.6 NON-PARAMETRIC METHODS

虽然置信区间和假设检验是统计分析中较为不同的方法,但它们在大多数统计方法中有紧密的数学联系,因为两者均基于相同的统计模型和抽样分布假设。理论分布由称为参数的量描述,尤其是均值和标准差,因此使用分布假设的方法称为参数方法。另一类统计方法不涉及分布假设,被称为无分布假设或非参数方法。由于这些方法基于秩而非实际数据分析,有时也称为秩方法。不幸的是,这三个术语都不能准确描述通常归入此类的所有方法。在本书中,我通常将这些方法称为
Although confidence intervals and hypothesis testing are rather different approaches to statistical analysis, they have a close mathematical link for the majority of statistical methods, because they are both based on the same statistical model and the same assumptions about sampling distributions. Theoretical distributions are described by quantities called parameters, notably the mean and standard deviation, so methods that use distributional assumptions are called parametric methods. There is another class of statistical methods which do not involve distributional assumptions which are called distribution- free or non- parametric methods. Because these methods are based on analysis of ranks rather than actual data, they are sometimes called rank methods. Unfortunately none of these three terms accurately describes all the methods usually considered to fall into this category. In this book I shall usually refer to these methods as

非参数方法,因为这是最常用的术语。注意,“非参数”是指用于分析数据的统计方法,而不是数据本身的属性。
non- parametric as this is the term in most frequent use. Note that 'non- parametric' applies to the statistical method used to analyse data, and is not a property of the data.

由于非参数方法通常不涉及任何分布假设,它们多用于分析不满足参数方法分布要求的数据—通常是数据不服从正态分布。偏态数据常用非参数方法分析,且基于秩的方法特别适合用于评分数据而非测量数据。这些评分可能有许多可能值,如视觉模拟量表的数据,或只有少数几个值,如Apgar评分或疾病分期。
As they do not usually involve any distributional assumptions, non- parametric methods are most often used to analyse data which do not meet the distributional requirements of parametric methods - usually that the data have a Normal distribution. Skewed data are commonly analysed by non- parametric methods, and methods using ranks are especially suitable for data which are scores rather than measurements. These could have many possible values, such as data from visual analogue scales, or only a few values, such as Apgar scores or stage of disease.

表8.1展示了一项针对1型糖尿病患者(Thuesen等,1985年)的空腹血糖数据及其秩次。当存在两个或多个相同值时,赋予这些“并列”观察值的秩次为平均秩次。
Table 8.1 shows fasting blood glucose data from a study of Type 1 diabetics (Thuesen et al., 1985) together with the ranks of the observations. When there are two or more identical values the average rank is

表8.1 24名1型糖尿病患者的空腹血糖水平(Thuesen等,1985年)
Table 8.1 Fasting blood glucose levels in 24 Type 1 diabetics (Thuesen et al., 1985)

血糖(mmol/l)秩次
4.21
4.92
5.23
5.34
6.75.5
6.75.5
7.27
7.58
8.19
8.610
8.811
9.312
9.513
10.314
10.815
11.116
12.217
12.518
13.319
15.120
15.321
16.122
19.023
19.524
Blood glucose (mmol/l)Rank order
4.21
4.92
5.23
5.34
6.75.5
6.75.5
7.27
7.58
8.19
8.610
8.811
9.312
9.513
10.314
10.815
11.116
12.217
12.518
13.319
15.120
15.321
16.122
19.023
19.524

这些“并列”观察值均赋予平均秩次,如两个6.7 mmol/l的值所示。我们可以不使用参数方法分析实际观察值,而是用非参数方法分析秩次。例如,我们可能想比较糖尿病患者两个亚组的血糖数据,分析将基于各亚组所有受试者秩次的总和。适当的方法将在下一章讨论。
given to each of the 'tied' observations concerned, as is shown for the two values of . Instead of analysing the actual observations using parametric methods we could analyse the ranks using non- parametric methods. For example, we might wish to compare the blood glucose data for two subgroups of the diabetics, for which the analysis would be based on the sums of the ranks for all subjects within each subgroup. The appropriate methods are discussed in the next chapter.

为了弥补不依赖数据分布假设的重要优势,秩次方法的缺点是更适合假设检验而非估计。非参数估计可以计算,最著名的例子是中位数,有时也能计算非参数置信区间。但对于更复杂的数据结构,估计变得困难甚至不可能,许多问题根本无法用秩次方法处理。
To compensate for the important advantage of being free of assumptions about the distribution of the data there is the disadvantage that rank methods tend to be more suited to hypothesis testing than estimation. Non- parametric estimates can be calculated, however, the best known example being the median, and it is also possible in some cases to calculate non- parametric confidence intervals. Estimation becomes difficult or impossible for more complex data structures and many problems cannot be handled at all using rank methods.

对于简单问题,如比较两个组间的一个变量或在一组内关联两个变量,无分布假设的方法具有明确优势,其使用将在后续章节与参数方法进行对比。
For simple problems, such as comparing one variable in two groups of subjects or relating two variables within one group the distribution free approach has definite advantages, and its use will be contrasted to the parametric approach in later chapters.

非参数方法多基于秩次和的比较。一组观察值的和是其平均值的简单倍数,因此中心极限定理也适用于这些秩次和。因此,除非样本量很小,进行非参数检验时通常可以使用正态近似,从而简化方法的应用。虽然这些方法明确避免对观察值分布做具体假设,但用正态分布描述(或近似)感兴趣统计量的抽样分布是两回事,必须加以区分。
Non- parametric methods are mostly based on comparing sums of ranks. The sum of a set of observations is a simple multiple of their average, so the central limit theorem also applies to these rank sums. Thus unless the samples are small it is often possible to use a Normal approximation when carrying out a non- parametric test, making it easier to apply the method. It seems strange to use the Normal distribution in this way when the methods explicitly avoid having to make any assumptions about the specific nature of the distribution of the observations. It is important to distinguish the two uses of the Normal distribution in statistics: to describe the distribution of a set of observations and to describe (or approximate) the sampling distribution of some quantity of interest.

8.7 统计建模 8.7 STATISTICAL MODELLING

估计和假设检验的思想背后,是一种称为建模的统计分析通用策略。统计模型是两个或多个变量之间的数学关系,用以对观察数据进行近似描述。我们通常不认为模型描述了变量关系的潜在机制,但它是一种与数据相符的简化。
Behind the ideas of estimation and hypothesis testing lies a general strategy for statistical analysis called modelling. A statistical model is a mathematical relationship between two or more variables that gives an approximate description of the observed data. We do not usually believe that the model describes the underlying mechanism of a relation between variables, but it is a simplification which is compatible with the data.

本书中描述的大多数参数方法都属于一个统一的理论框架,称为线性模型,其中“线性”意指“加性”。其思想是观察数据可以用一个模型解释,其中不同影响因素的效应是相加的。回到第3章开头关于血压的例子,一个血压的统计
Most of the parametric methods described in this book fall into a unified theoretical framework known as linear models, where 'linear' means 'additive'. The idea is that the observed data can be explained by a model in which the effects of different influences are added. To return to the example of blood pressure given at the start of Chapter 3, a statistical

模型可能包括与年龄、性别、种族、吸烟、时间等相关的因素。
model for blood pressure might include contributions relating to age, sex, race, smoking, time of day, and so on.

本书中大多数分析的基础统计模型非常简单,通常不会详细描述,但我将在第11章和第12章中明确介绍模型。然而,统计模型相关的两个关键思想贯穿始终。首先,拟合模型时需要做出某些假设,且验证这些假设是否合理非常重要。一个常见的例子是假设数据近似服从正态分布,这种假设几乎出现在本书描述的所有模型中。其次,还需考虑模型对数据的“拟合”程度。我们需要检查是否存在系统性偏差,同时也要评估模型预测个体值的实用性。例如,许多研究者拟合模型试图根据母体特征和胎儿测量值预测出生体重。尽管许多变量与出生体重有关,但包含所有已知影响因素的模型对单个婴儿的出生体重预测仍不够准确。从第11章将定义的意义上讲,这些模型仅能解释出生体重变异的25%至30%。这里再次体现了估计与假设检验的区别。模型中的变量与出生体重在统计上显著相关,无论是单独还是整体,但模型得出的出生体重估计过于不精确,临床上无实用价值(尽管流行病学上可能有用)。
model for blood pressure might include contributions relating to age, sex, race, smoking, time of day, and so on.For most analyses described in this book the underlying statistical model is very simple and will not usually be described, but I shall introduce models explicitly in Chapters 11 and 12. However, two key ideas associated with statistical models will be apparent throughout. First, certain assumptions are made when we fit a model, and it is important to try to verify that these are reasonable. An obvious common example is the assumption that the data have an approximately Normal distribution, some form of which appears in nearly all of the models described in this book. Second, it is also important to consider two aspects of how well the model 'fits' the data. We need to check that there are no systematic discrepancies, and we must also consider how useful the model is at predicting a value for an individual. For example, many researchers have fitted models to try to predict birthweight from maternal characteristics and fetal measurements. Although many variables are known to be related to birthweight, models that include all known influences do not allow us to predict birthweight at all accurately for an individual baby. In a sense to be defined in Chapter 11, the models account for only of the variability in birthweight. Here we see again the distinction between estimation and hypothesis testing. The variables in the model are significantly associated with birthweight, both individually and collectively, but the estimates of birthweight derived from the model are too imprecise to be clinically useful (although they may be epidemiologically useful).

8.8 估计还是假设检验? 8.8 ESTIMATION OR HYPOTHESIS TESTING?

过去40年中,医学研究中统计方法的使用激增,假设检验广泛应用,且分析方法趋向复杂。如今,几乎所有研究论文都包含假设检验,但遗憾的是,这往往以牺牲对数据的其他解释为代价。尤其常见的是,将某些比较结果仅用P值表示,甚至仅标注“显著”或“不显著”。虽然P值提供了信息,但只是部分内容,需辅以更直接的观察结果信息。
Over the last 40 years there has been a dramatic surge in the use of statistical methods in medical research, with widespread use of hypothesis tests and a trend towards more complex methods of analysis. Nowadays few research papers do not include hypothesis tests, but unfortunately their use is often at the expense of any other interpretation of the data. In particular it is common to see the results of some comparison expressed solely as a P value, or even just as 'significant' or 'not significant'. While P values are informative they tell only part of the story, and need to be accompanied by more direct information about what was actually observed.

有些研究纯粹是探索性的,比如寻找值得深入研究的潜在关联,但大多数研究的结果不能仅凭“统计显著”一词作有意义解释。如前所述,统计显著不等同于临床显著,非显著结果也不应忽视。通过简单估计量对结果进行量化,是数据分析的必要部分。临床医生是否采用一种能降低血压或偏头痛发作频率的新疗法,
Some research is purely exploratory, for example looking for possible associations worthy of more detailed study, but for most research the results cannot be meaningfully interpreted from a pronouncement of 'statistically significant'. As discussed above, it is not necessarily true that such a result is clinically significant, nor is a non- significant finding necessarily ignorable. Quantification of the results by simple estimates is an essential part of the analysis of data. Whether a clinician will use a new treatment that reduces blood pressure or the frequency of migraines will

取决于降低的幅度,也可能取决于效果的一致性。一个能让所有人偏头痛发作率降低30%的药物,可能优于对部分患者降低50%但对其他患者无效的药物。单一数值(P值)无法传达所有必要信息;还需提供适当的估计值和置信区间。
depend on the amount of the reduction. It may also depend on how consistent the effect is. A drug that reduces everybody's incidence of migraines by may be better than one which reduces the incidence by for some patients but does nothing for others. A single number (the P value) cannot convey all the necessary information; the appropriate estimates and confidence intervals are needed too.

大多数发表的研究确实包括了感兴趣效应的估计,且包含P值已成标准,但直到最近置信区间的使用仍较少。近年来,多家顶级医学期刊开始鼓励甚至要求作者在主要结果中同时报告置信区间(见Gardner和Altman,1989a)。
Most published research does include estimates of the effects of interest, and it has become standard practice to include values, but until recently the use of confidence intervals was rare. Lately, however, there has been a welcome move by several leading medical journals towards encouraging or even requiring authors to present confidence intervals in conjunction with their main findings (see Gardner and Altman, 1989a).

8.8.1 置信区间与统计显著性的关系 8.8.1 Relation between confidence intervals and statistical significance

虽然假设检验和置信区间看似不同,但它们之间实际上有密切关系。只有当95%置信区间不包含零(或更一般地,零假设中指定的值)时,值才会小于0.05(即“显著”)。这种关系的原因在于两种方法都基于检验统计量理论分布的相似方面。同样的关系也适用于99%置信区间与1%显著性水平的相关显著性检验,依此类推。
Different though hypothesis testing and confidence intervals may appear there is in fact a close relation between them. The value will be less than 0.05 (i.e. 'significant') only when the confidence interval does not include zero (or, more generally, the value specified in the null hypothesis). The reason for this relation is that both methods are based on similar aspects of the theoretical distribution of the test statistic. The same relation applies between the confidence interval and the related significance test at the level, and so on.

置信区间显示了估计值的不确定性或缺乏精确性,因此比值传递了更多有用信息。由于上述关系,通过给出置信区间,我们也能表明值是否高于或低于5%的临界值。理想情况下应同时报告实际的值和置信区间,但如果只能给出其中之一,值可以省略—它相对不那么重要,而且无论如何可以从置信区间大致推断。
The confidence interval shows the uncertainty, or lack of precision, in the estimate of interest, and thus conveys more useful information than the value. Because of the relation described above, by presenting a confidence interval we also indicate whether is above or below the cut- off level of . The presentation of both the actual value and the confidence interval is desirable, but if only one is given the value may be omitted - - it is less important, and in any case can be gauged roughly from the confidence interval.

本节讨论的问题在Cox(1982)和Gardner与Altman(1989b)中有更详细的论述。
The issues discussed in this section are considered at greater length by Cox (1982) and Gardner and Altman (1989b).

8.9 数据分析策略 8.9 STRATEGY FOR ANALYSING DATA

我强烈建议使用计算机,或者至少是可编程计算器,来进行统计分析。第6章介绍了使用计算机的各种优点,同时也提及了一些缺点。第6.6节给出了使用计算机分析数据的策略,尽管这些原则并不限于计算机分析。
I strongly recommend that a computer, or at least a programmable calculator, is used for statistical analysis. Chapter 6 presented various advantages, but also some drawbacks, of using a computer. Section 6.6 gave a strategy for analysing data using a computer, although the principles are not specific to analysis by computer.

第6章未涉及的一个方面是如何判断哪种方法适合分析一组数据。第9至12章描述了大量不同的分析方法。
One aspect not covered in Chapter 6 was how to tell which is the appropriate method of analysing a set of data. Chapters 9 to 12 describe a

这些章节的标题描述了解决的问题,而非方法名称:
large number of different methods of analysis. The titles of these chapters are descriptive of the problems tackled rather than the names of the methods:

章节标题
Chapter Title

9章 比较组别—连续数据
9 Comparing groups - continuous data

10章 比较组别—分类数据
10 Comparing groups - categorical data

11 两个连续变量之间的关系
11 Relation between two continuous variables

12 多个变量之间的关系
12 Relation between several variables

第9章和第10章涵盖了对一个、两个或多个组中单一感兴趣变量的分析。这些章节中区分了对不同个体组的观察和对同一组个体多次观察—“配对数据”。相比之下,第11章和第12章则关注单一组个体中两个或多个变量之间的相互关系。需要注意的是,在大多数研究中会收集大量变量的信息,但这些变量通常是分别使用第9章和第10章中的较简单技术进行分析。第12章则指导何时采用这种方法是合理的,何时不合理。
Chapter 9 and 10 cover analyses where you have a single variable of interest for one, two or more groups. Within these chapters the distinction is made between observations made on different groups of individuals and observations made on more than one occasion on the same individuals - 'paired data'. Chapters 11 and 12, in contrast, cover analytes where we are interested in the inter- relationship between two or more variables for a single group of individuals. Note that in most studies information on a large number of variables is collected, but the variables are analyzed separately using the simpler techniques of Chapters 9 and 10. Chapter 12 gives guidance on when this is or is not a sensible approach.

第13章考虑生存时间的分析,这是一种第9章所讨论问题的特殊情况,需采用特殊的分析方法,以及更一般的时间相关数据分析问题。第14章讨论医疗数据分析中的一些具体常见问题。这些章节中描述的许多方法都提供了置信区间和假设检验。
Chapter 13 considers the analysis of survival times, which is a special case of the problems considered in Chapter 9, and requires special method of analysis, and more general problems in the analysis of time- related data. Chapter 14 discusses some specific common problems in the analysis of medical data. For many of the methods described in these chapters both confidence intervals and hypothesis tests are presented.

8.10 结果的呈现 8.10 PRESENTATION OF RESULTS

本章介绍的方法在后续多章中反复出现,因此对结果呈现的一些通用评论可能有所帮助。
The methods introduced in this chapter recur in several subsequent chapters so some general comments on presentation of results may be helpful.

估计值和置信区间应当以同样方式处理,均值和标准差亦然(参见3.7节)。应说明置信区间的覆盖百分比。
Estimates and confidence intervals should be treated in the same way, means and standard deviations (see section 3.7). The percentage coverage of confidence intervals should be stated.

尽可能给出实际的P值,而非诸如P<0.05这样的范围。P值不需超过两位有效数字,例如P=0.14、P=0.012、P=0.001。通常无需具体说明P值低于0.0001。如果P值来自表格,则会落在两个界限之间,依据表格列出的数值。我们在表达式中使用符号“<”(小于)和“>”(大于),如P<0.05或0.05>P>0.01。当P值介于0.01和0.05之间时,习惯使用较短的表达P<0.05,因为假设如果P值小于0.01,会直接写P<0.01。对于大于0.05的P值,最好比简单写P>0.05更具体,例如P=0.15或
Where possible give actual P values rather than ranges such as P<0.05 No more than two significant figures need be quoted. as in P=0.14 P=0.012. P=0.001. It is not usually necessary to specify P below. 0.0001. If you obtain P from tables then you will cad up with a value between two limits, according to the values that are tabulated. We use the signs <' (less than) and >' (greater than) in expressions such as P<0.05 or 0.05 >P>0.01. It is convcational to use the shorter limits P<0.05 when P is between 0.01 and 0.05. as it is assumed that if P was less than 0.01 you would have used P<0.01. For values of P greater than 0.05 it is useful to be more specific than P>0.05. for example by P=0.15

不要使用缩写NS表示不显著,除非定义了该术语(通常指),且请勿使用糟糕的表达。一般假设值是双侧的,除非另有说明。单侧检验的使用应始终注明(并给出理由)。
or . Do not use the abbreviation NS for not significant without defining the term (usually ) and please do not use the appalling . It is generally assumed that values are two- sided unless stated otherwise. The use of one- sided tests should always be noted (and justified).

8.11 总结 8.11 SUMMARY

分析您自己的数据并能够评估医学文献,依赖于理解统计分析背后的基本思想以及熟悉所使用的统计方法。
Analysing your own data and being able to evaluate the medical literature depend upon understanding the basic ideas behind statistical analysis as well as being familiar with the statistical methods used.

本章详细讨论了与感兴趣参数(如均值或比例)相关的抽样分布的概念。一个重要主题是中心极限定理,它说明随着样本量的增加,样本均值的抽样分布趋近于正态分布,而不论总体数据分布的形态如何。这个结果是后续章节中许多方法的基础。
In this chapter I have discussed in detail the idea of a sampling distribution relating to a parameter of interest, such as a mean or a proportion. A major topic covered was the central limit theorem, by which the sampling distribution of the mean of a sample approaches a Normal distribution as the sample size increases, regardless of the shape of the distribution of the data in the population. This result underlies many of the methods described in subsequent chapters.

我还介绍了统计推断的两种主要方法—估计和假设检验。所述的一般原则对于理解本书其余章节以及理解统计分析和解释的本质至关重要。已发表的论文往往以简略的方式呈现结果,可能显得晦涩,例如只给出均值和标准误。了解这些量能推断出什么和不能推断出什么非常重要,尤其是通过构建置信区间。同样,大多数已发表的论文包含 值,但对其解释常常存在错误。理解 值的真实含义,并认识到统计显著性与临床重要性并非一回事,这一点非常重要。
I have also introduced the two main approaches to statistical inference - - estimation and hypothesis testing. The general principles outlined are fundamental to an appreciation of the remaining chapters of this book, and to understanding what statistical analysis and interpretation is all about. Published papers tend to present results in a shorthand way that can be opaque - for example as means and standard errors. It is important to know what can and cannot be inferred from these quantities, especially by constructing confidence intervals. Likewise, most published papers contain values but the interpretation of them is often faulty. It is important to understand the true meaning of the value, and to realize that statistical significance and clinical importance are not the same thing.

在阅读接下来介绍具体统计方法的章节后,重新阅读本章的部分内容可能会有所帮助。
It may be helpful to re- read parts of this chapter after the next few chapters describing particular statistical methods.

练习 EXERCISES

【8】1 一个城镇中有两家医院。较大医院平均每天出生45个婴儿,较小医院平均每天出生15个。男婴的概率约为0.52,双胞胎的概率约为0.012。在任何一天,哪个医院更可能
8.1 There are two hospitals in a town. On average 45 babies are born each day in the larger hospital, and 15 in the smaller. The probability of a baby being a boy is about 0.52, and the probability of twins is about 0.012. On any day which hospital is more likely

(a) 产下一对双胞胎,
(a) to have a set of twins delivered,

(b) 男婴比例超过
(b) to have more than of babies being boys?

(回答这些问题不需要数学知识。)
(No mathematics is required to answer these questions.)

(基于 Kahneman 和 Tversky,1982)
(Based on Kahneman and Tversky, 1982)

8.2 八名糖尿病患者在口服 葡萄糖前及一小时后测量了血浆葡萄糖水平(mmol/l)(Feingold 等,1989),结果如下
8.2 Eight diabetic patients had plasma glucose levels (mmol/l) measured before and one hour after oral administration of glucose (Feingold et al., 1989), with the following results

患者服用前血浆葡萄糖(mmol/l)
服用后变化量
14.675.440.77
24.9710.115.14
35.118.493.38
45.176.611.44
55.3310.675.34
66.225.67-0.55
76.505.78-0.72
87.009.892.89
PatientBeforePlasma glucose (mmol/l)
AfterChange
14.675.440.77
24.9710.115.14
35.118.493.38
45.176.611.44
55.3310.675.34
66.225.67-0.55
76.505.78-0.72
87.009.892.89

(a) 计算血浆葡萄糖平均变化量的标准误差。
(a) Calculate the standard error of the mean change in plasma glucose.

(b) 基于这些数据,需要研究多少名糖尿病患者,才能使血浆葡萄糖平均变化量的 置信区间宽度为 ?(假设葡萄糖变化量服从正态分布。)
(b) On the basis of these data, how many diabetic patients would need to be studied so that the width of the confidence interval for the mean change in plasma glucose level was ? (Assume that the Normal distribution is the appropriate sampling distribution for the change in plasma glucose.)

8.3 在一项临床试验中,100名患者通过简单随机化分配到两种治疗组,证明两组患者数量差异超过20的概率约为 。(提示:考虑分配到其中一组的患者数量的分布。)
8.3 In a clinical trial in which a total of 100 patients are allocated to two treatments by simple randomization, show that the probability that the difference between the numbers of patients in the two treatment groups exceeds 20 is about . (Hint: consider the distribution of the number of patients allocated to one of the groups.)

8.4 一项对照试验比较了皮质类固醇泼尼松龙与安慰剂在乙型肝炎表面抗原阳性慢性活动性肝炎患者中的效果(Lam 等,1981)。针对一封批评分析方法的信,作者回复道:“计算中使用了单尾检验,因为在之前的分析中,类固醇治疗组出现严重并发症的频率显著更高”(Ng 等,1981)。(该信息未在原文中披露。)
8.4 A controlled trial was performed to compare the corticosteroid prednisolone and placebo in patients with chronic active hepatitis positive for hepatitis B surface antigen (Lam et al., 1981). In response to a letter criticizing the analysis the author wrote: 'The one- tailed test was used in the calculations, since in a previous analysis major complications were encountered significantly more frequently in the steroid- treated group' (Ng et al., 1981). (This information had not been given in the original paper.)

这是使用单尾检验的合理理由吗?如果不是,为什么?
Is this a valid justification for performing one- tailed tests? If not, why not?

9 比较组别—连续数据 9 Comparing groups - continuous data

好的答案来自好的问题,而非深奥的分析。Schoolman 等,(1968)
Good answers come from good questions not from esoteric analysis. Schoolman et al., (1968)

9.1 引言 9.1 INTRODUCTION

我们现在可以基于前几章的思想来考虑主要的数据分析方法。特别是,我们将使用上一章介绍的思想—估计和假设检验。其他重要的思想包括分析与研究设计的关系(第5章)以及观测数据的性质(第2章)。
We can now build on the ideas of the previous chapters to consider the main methods of data analysis. In particular we will use the ideas introduced in the previous chapter - estimation and hypothesis testing. Other important ideas are the relation between the analysis and the research design (Chapter 5) and the nature of the observations (Chapter 2).

本章处理关于连续数据的组间比较,起始于最简单的情况,即将单组观测值与某个预设值比较,逐步扩展到对一组个体的多组观测值进行比较。介绍了参数方法和非参数方法。第10章将讨论数据为分类变量时的类似情况。
This chapter deals with comparing groups of observations with respect to continuous data, starting with the simplest case where we wish to compare a single group of observations with some prespecified value, and moving through to the case where we have several sets of observations on each of a group of individuals. Both parametric and non- parametric approaches to analysis are introduced. Chapter 10 considers the same situations when the data are categorical.

9.2 选择合适的分析方法 9.2 CHOOSING AN APPROPRIATE METHOD OF ANALYSIS

在选择合适的分析方法时,我们必须考虑数据的几个方面,涉及研究设计、数据性质以及分析目的。
When choosing an appropriate method of analysis there are several aspects of the data that we must consider, relating to the design of the study, the nature of the data, and the purpose of the analysis.

9.2.1 观测组数 9.2.1 The number of groups of observations

虽然处理多组观测的方法也可用于一组或两组观测,但分别考虑一组和两组情况更为方便,因为方法可以简化,且解释问题较少。两组情况是最常见的统计分析类型。
Although methods of dealing with several groups of observations can be used for just one or two groups it is convenient to consider the one and two group cases separately, as the methods can be simplified, and there are fewer problems of interpretation. The two group case is the most common type of statistical analysis.

9.2.2 独立组还是相关组观测 9.2.2 Independent or dependent groups of observations

当有两个或更多组观察数据时,必须区分两种设计类型:
When there are two or more sets of observations there are two types of design that must be distinguished:

【1】观察数据来自独立的个体组。例如,我们可能有男孩和女孩的出生体重,或不同疾病患者的分组。各组的样本量可能不同。

  1. The observations relate to independent groups of individuals. For example, we may have birthweights of boys and girls or groups of patients with different diseases. The sample size may vary from group to group.

【2】每组观察数据均来自同一组个体。例如,我们可能有一组女性的产前和产后血压测量数据。我们称此类数据为配对数据,以表明观察对象是同一批个体,而非独立样本。显然,每组数据的观察数量必须相同。
2. Each set of observations is made on the same group of individuals. For example, we may have antenatal and postnatal blood pressure measurements from one group of women. We call such data paired to indicate that the observations are on the same individuals rather than from independent samples. Clearly we must have the same number of observations in each set of data.

有时研究两个不同的受试者组,其中每个人都与另一组的某个成员一一匹配。此时数据显然是关联的,应视为对同一组的配对观察。
Sometimes two different groups of subjects are studied where each person is individually matched with a member of the other group. Here the data are clearly linked and should be treated as if they are paired observations on one group.

9.2.3 数据类型 9.2.3 The type of data

连续型和分类数据的区别在第2章中已有讨论。然而,连续数据有多种类型,观察性质对统计分析有影响。具体来说,参数方法基于均值和标准差的计算,因此不适用于有序分类数据,如第2章中描述的阿普加评分。
The distinction between continuous and categorical data was discussed in Chapter 2. There are several types of continuous data, however, and the nature of the observations has implications for statistical analysis. Specifically, parametric methods are based on calculating means and standard deviations, so they are inappropriate for ordered categorical data such as the Apgar score described in Chapter 2.

9.2.4 数据分布 9.2.4 The distribution of data

对于独立组,参数方法要求每组内的观察值近似服从正态分布,且各组的标准差应相似。如果原始数据不满足这些条件,可以尝试变换(见第7章)。否则应采用非参数方法。
For independent groups, parametric methods require the observations within each group to have an approximately Normal distribution, and the standard deviations in each group should be similar. If the raw data do not satisfy these conditions, a transformation may be successful (see Chapter 7). Otherwise a non- parametric method should be used.

对于同一组个体的两个或多个配对观察数据,不要求每组观察值都服从正态分布,但有另一种正态性假设,详见下文。
For paired data relating to two or more observations on the same people there is no assumption that each set of observations should be Normally distributed, but there is a different assumption of Normality, discussed below.

9.2.5 分析目标 9.2.5 The objective of the analysis

本章贯穿始终地考虑了估计和假设检验。然而,当数据组数达到三组或更多时,组间存在多种可能的比较方式。
Both estimation and hypothesis testing are considered throughout this chapter. However, with three or more groups of data there are several

应调查哪种比较方式,应直接根据研究的目标来决定。
possible comparisons between groups. The choice of which to investigate should follow directly from the objectives of the study.

9.3 分布 9.3 THE DISTRIBUTION

在上一章中,我展示了如何基于感兴趣的估计值(均值或比例)服从正态分布的假设,计算置信区间和进行假设检验。由于中心极限定理,我们知道对于大样本而言这一假设是合理的,但并非所有样本都很大(例如超过100)。在连续数据分析中,均值的计算占据重要地位,因此我们需要考虑小样本均值的分布。
In the previous chapter I showed how to calculate confidence intervals and perform hypothesis tests based on the assumption that the estimates of interest, either means or proportions, had a Normal distribution. Because of the central limit theorem we know that this is a reasonable assumption for large samples, but not all samples are large (more than 100, say). In the analysis of continuous data the calculation of means plays a prominent part, and so we need to consider the distribution of the mean for smaller samples.

本世纪初,W. S. Gossett(以“Student”笔名发表)证明了来自未知方差的正态分布样本均值,其分布类似但不完全等同于正态分布。他称之为 分布,我们至今仍称之为Student的 分布。随着样本量的增加,均值的抽样分布趋近于正态分布。我们使用 分布来对一组或两组样本的均值进行估计和假设检验。尽管大样本可以使用正态分布,但这样做意义不大,因为大样本时两种方法几乎给出相同结果,且统一使用同一方法更为简便。
Early in this century it was shown by W. S. Gossett, writing under the name of 'Student', that the mean of a sample from a Normal distribution with unknown variance has a distribution that is similar to, but not quite the same as, a Normal distribution. He called it the distribution, and we still refer to it as Student's distribution. As the sample size increases the sampling distribution of the mean becomes closer to the Normal distribution. We use the distribution for estimation and hypothesis testing relating to the means of one or two samples. Although we can use the Normal distribution for large samples there is little point in doing so, since for large samples the methods give virtually identical answers and it is simpler to use the same method regardless of the sample size.

分布有一个参数,称为自由度。自由度的概念是统计学中较为抽象的思想之一。一般来说,自由度等于样本量减去估计参数的数量。 分布的自由度与估计标准差有关,标准差是围绕估计均值的变异度计算得出。因此,对于单个样本的 个观测值,自由度为
The distribution has one parameter, a quantity called the degrees of freedom. The concept of degrees of freedom is one of the more elusive statistical ideas. In general the degrees of freedom are calculated as the sample size minus the number of estimated parameters. The degrees of freedom for the distribution relate to the estimated standard deviation, which is calculated as variation around the estimated mean. Hence for a single sample of observations we have degrees of freedom.

图9.1显示了自由度为5和25的 分布及标准正态分布。自由度为25时的 分布已接近正态分布,且随着样本量增加, 分布愈发接近正态分布。差异主要体现在分布的尾部,而尾部通常是我们关注的重点。
Figure 9.1 shows the distributions with 5 and 25 degrees of freedom, together with the Normal distribution. The latter is close to the Normal distribution, and as the sample size increases the distribution becomes ever more Normal. The difference is most marked in the tails of the distributions, which is usually the part that we are interested in.

本章介绍的几乎所有参数方法,以及后续大多数方法,都使用 分布。在第8章,我展示了如何通过将感兴趣的量除以其标准误,利用正态分布计算检验统计量。使用 分布时计算方法相同,唯一不同的是查表时使用 分布表(表B4)而非正态分布表。同样,我们用 分布计算置信区间。
Nearly all the parametric methods introduced in this chapter, and most that follow, make use of the distribution. In Chapter 8 I showed how we calculate a test statistic using the Normal distribution by taking the ratio of the quantity of interest to its standard error. We use the same method of calculation when using the distribution. The only difference is that we look up the result in a table of the distribution (Table B4) rather than the Normal distribution. Likewise, we use the distribution to calculate confidence intervals.


图9.1 Student的 分布,(a) 自由度为5,(b) 自由度为25,同时展示标准正态分布。
Figure 9.1 Student's distribution with (a) five, and (b) 25 degrees of freedom. together with the standard Normal distribution.

本章首先讨论三种使用该分布的情形—单样本、配对样本和两独立样本。最后,对于超过两组样本的情况,我们需要使用称为方差分析的方法,采用 分布(后文介绍)而非 分布。所有这些参数方法都对正态性做出假设。第9.7节介绍了通过取对数处理偏态数据的分析方法。或者,本章讨论的所有问题均可采用非参数方法,相关内容在各节中均有介绍。
This chapter deals first with three situations where we use the distribution - for one sample, paired samples, and two independent samples. Lastly, for the case with more than two samples we need the method called analysis of variance, for which we use the distribution (introduced later) rather than the distribution. All these parametric methods make assumptions about Normality. Section 9.7 describes the analysis of skewed data by taking logarithms. Alternatively, non- parametric methods are available for all the problems discussed in this chapter, and are introduced within each section.

9.4 一组观测值 9.4 ONE GROUP OF OBSERVATIONS

最简单的情况是我们希望将一组观测值的均值与某个特定值进行比较。这样的比较并不常见,但这一简单案例的方法论为统计推断的主要方法提供了良好的入门。第9.4.1节和9.4.2节介绍参数方法,相应的非参数方法则在第9.4.3节到9.4.5节中说明。
The simplest case to consider is when we wish to compare the mean of a single group of observations with a specific value. Comparisons like this are not very common, but the methodology for this simple case gives a good introduction to the main methods of statistical inference. Sections 9.4.1 and 9.4.2 describe parametric methods, with the equivalent non- parametric methods described in sections 9.4.3 to 9.4.5.

举例来说,假设我们希望将某一组个体的平均膳食摄入量与推荐的每日摄入量进行比较。表9.1显示了11名年龄在22至30岁的健康女性在10天内的平均每日能量摄入量。她们的平均每日摄入量为6753.6 kJ。这个样本量虽小,但观测值无明显偏态,可合理视为近似正态分布。注意,每个观测值本身是多天数据的平均值。对于高度变异的量,取多个观测值有时是个好主意。我们能如何评价这些女性的能量摄入量与推荐的每日摄入量7725 kJ的关系?
As an example, suppose we wish to compare the mean dietary intake of a particular group of individuals with the recommended daily intake. Table 9.1 shows the average daily energy intake over ten days in 11 healthy women aged 22- 30. Their mean daily intake was . The small sample of observations shows no obvious skewness and may reasonably be taken as approximately Normal. Notice that each observation is itself an average value over several days. It is sometimes a good idea to take several values of a highly variable quantity. What can we say about the energy intake of these women in relation to a recommended daily intake of

表9.1 11名健康女性10天内的平均每日能量摄入量(单位:kJ)(Manocha等,1986)
Table 9.1 Average daily energy intake (kJ) over 10 days of 11 healthy women (Manocha et al.,1986)

受试者平均每日能量摄入(kJ)
15260
25470
35640
46180
56390
66515
76805
87515
97515
108230
118770
均值6753.6
标准差1142.1
SubjectAverage daily energy intake (kJ)
15260
25470
35640
46180
56390
66515
76805
87515
97515
108230
118770
Mean6753.6
SD1142.1

9.4.1 均值的置信区间 9.4.1 Confidence interval for the mean

这11名女性的平均每日能量摄入量低于推荐的7725 kJ,平均缺口为7725 - 6753.6 = 971.4 kJ。
On average the 11 women had a daily energy intake below the recommended level of , the average deficit being .

11个每日摄入量的标准差为1142.1 kJ,因此均值的标准误为1142.1 / √11 = 344.4 kJ。我们使用t分布计算均值的置信区间,遵循第8.4.5节中介绍的原则。对于95%的置信区间,我们需要对应尾部面积为0.05的t值,记为t_{0.975},自由度为11 - 1 = 10。根据表B4,所需的t值为2.228。均值摄入量的95%置信区间为
The standard deviation of the eleven daily intakes was , so the standard error of the mean intake is . We use the distribution to calculate a confidence interval for the mean daily intake, following the principles outlined in section 8.4.5. For a confidence interval we need the value of corresponding to a tail area of 0.05, denoted , with degrees of freedom. From Table B4 the value of we need is 2.228. The confidence interval for the mean intake is thus

即5986到7521 kJ。
or 5986 to .

该区间不包括推荐的 水平。如果我们假设这些女性是具有代表性的样本,那么可以推断该年龄段所有女性的平均每日能量摄入低于推荐值。然而,基于如此小的样本,尤其是在不了解样本如何选择的情况下,这样的解释是不明智的。
This range does not include the recommended level of . If we assume that the women are a representative sample, then we can infer that for all women of this age average daily energy consumption is less than is recommended. The interpretation would, however, be unwise on the basis of such a small sample, especially without knowledge of how the sample was selected.

同样,我们可以计算能量缺口的置信区间。平均能量缺口为 。平均缺口的标准误与平均摄入的标准误相同,因为从分布或观测值的每个数值中减去一个常数不会影响标准差。因此,能量缺口的 置信区间是通过从平均每日摄入的置信区间中减去 7725 获得的。忽略负号,我们得到能量缺口的 置信区间为 204 到
Similarly we can calculate a confidence interval for the energy deficit. The mean energy deficit was . The standard error of the mean deficit is the same as the standard error of the mean intake because subtracting a constant from each value of a distribution or set of observations does not affect the standard deviation. The confidence interval for the energy deficit is thus obtained by subtracting 7725 from the confidence interval for the mean daily intake. Ignoring the negative sign, we get the confidence interval for the energy deficit as 204 to .

9.4.2 单样本 检验 9.4.2 One sample test

我们还可以对零假设进行检验,即我们的数据来自一个具有特定“假设”均值的总体。该检验称为单样本 检验, 值计算公式为
We can also carry out a test of the null hypothesis that our data are a sample from a population with a specific 'hypothesized' mean. The test is called the one sample test, and the value of is calculated as

遵循第8.5节描述的常见假设检验形式。如果假设总体均值为某个值 ,则公式可改写为
following the common form of hypothesis tests described in section 8.5. If the hypothetical population mean is some value , we can rewrite the formula as

或者
or

其中 分别为样本大小为 的样本均值和标准差。 的大小即为样本值与假设均值的平均偏差除以样本均值的标准误。
where and are the mean and standard deviation of the sample of size . The magnitude of is thus the average discrepancy of the sample values from the hypothetical mean, divided by the standard error of the sample mean.

饮食摄入数据的均值和标准差分别为6753.6和,感兴趣的假设值是推荐摄入量。因此我们可以计算值为
The mean and standard deviation of the dietary intake data were 6753.6 and , and the hypothetical value of interest was the recommended intake of . We can thus calculate the value of as

我们使用表B4查找与观察到的值相关的值。对于双侧检验,可以忽略的符号,查找自由度为10时小于我们观察值的最大表中值。根据表B4,,因此这些女性的饮食摄入显著低于推荐水平,采用常用的标准。注意,统计显著性并不提供能量赤字大小或该估计不确定性的信息。
We use Table B4 to find the value associated with an observed value of . We can ignore the sign of for a two- sided test, and look for the largest tabulated value of below our observed value, using 10 degrees of freedom. From Table B4 we get , so that the dietary intake of these women was significantly less than the recommended level using the usual criterion of . Notice that statistical significance gives no information about the magnitude of the energy deficit, nor the uncertainty of that estimate.

注意,我们用表示检验统计量的观察值,也用它表示理论分布中的特定值。为清晰起见,我在后者情况下总是使用下标。对于许多其他统计方法,我们对这两种用途使用稍有不同的符号。
Note that we use to indicate the observed value of the test statistic and also a particular value from the theoretical distribution. For clarity I always use a subscript in the latter case. For many other statistical methods we use slightly different notation for these two purposes.

9.4.3 中位数的置信区间 9.4.3 Confidence interval for the median

使用分布计算置信区间或进行检验的方法要求数据近似正态分布。如果数据偏斜或呈其他非正态分布,我们可以基于中位数而非均值进行推断。11名女性的中位能量摄入是第6高的摄入量,表9.1显示为。我们可以在不做任何关于数据分布假设的情况下计算样本中位数的置信区间。数据按升序排列,置信区间的值的秩次从表B11中查得。根据该表,置信区间由秩次为2和10的数据值确定,即从5470到
The methods using the distribution to calculate a confidence interval or perform a test require the data to be approximately Normally distributed. If the data are skewed or have some other non- Normal distribution we can base our inference on the median rather than the mean. The median energy intake in the 11 women was the 6th highest intake, which Table 9.1 shows was . We can calculate a confidence interval for a sample median without making any assumptions about the distribution of the data. The data are ranked in ascending order, and the ranks of the values defining the confidence interval are found from a table such as that given in Table B11. From that table the confidence interval for the median is given by the data values with ranks 2 and 10; that is, from 5470 to .

对于小样本,中位数的置信区间较宽,这里几乎是之前均值置信区间宽度的两倍。对于具有正态分布的大样本,均值和中位数及其置信区间将非常相似(尽管中位数的置信区间趋于更宽)。如果数据不接近正态分布,优先使用中位数。
For small samples the confidence interval for the median is rather wide, here being nearly twice as wide as the confidence interval for the mean given earlier. For larger samples of data that have a Normal distribution the mean and median will be very similar as will their confidence intervals (although that for the median will tend to be wider). It is preferable to use the median if the data are not near to Normal.

我将描述两种对单个样本进行非参数假设检验的方法:符号检验和Wilcoxon符号秩和检验。
I shall describe two methods for carrying out a non- parametric hypothesis test for a single sample, the sign test and the Wilcoxon signed rank sum test.

9.4.4 符号检验 9.4.4 Sign test

如果样本值与假设的特定值平均无差异,我们期望观察到的值有相同数量落在该特定值的上方和下方。因此,通过计算观察到的高于和低于该特定值的频数的概率,我们可以评估在原假设成立时观察到数据的可能性。这与例如计算样本中属于B血型人数的概率问题完全相同。我们因此使用二项分布,或其正态近似,来评估观察频数的概率,假设超过预期摄入量的真实概率。在我们的例子中,感兴趣的假设摄入量为。两名女性的日摄入量高于7725,九名低于7725。我们使用第8.5节给出的检验统计量通用公式:
If there were no difference on average between the sample values and the hypothesized specific value we would expect an equal number of observations above and below the specific value. We can thus see how likely it would be to have observed our data when the null hypothesis is true by calculating the probability of our observed frequencies above and below the specific value. This is precisely the same type of problem as, for example, calculating the probability of observing given numbers of people in a sample who are in blood group B. We thus use the Binomial distribution, or the Normal approximation to it, to evaluate the probability of the observed frequencies when the true probability of exceeding the expected intake, , is .In our example the hypothesized intake of interest was . Two women had daily intakes above 7725 and nine were below. We use the general formula for a test statistic given in section 8.5:

在我们的例子中,假设感兴趣的摄入量为 。两名女性的日摄入量高于7725,九名低于7725。我们使用第8.5节中给出的检验统计量的一般公式:
In our example the hypothesized intake of interest was .Two women had daily intakes above 7725 and nine were below. We use the general formula for a test statistic given in section 8.5:

这里我们关注的是二项分布,参数为 。我们观察到的计数为 —由于当 时分布的对称性,使用哪一个无关紧要。假设原假设成立,期望计数为 。根据第4.9节, 的标准误为
Here we are interested in the Binomial distribution with and Our observed count is either or - it does not matter which we use because of the symmetry of the distribution when . The expected count, assuming the null hypothesis is true, is . From section 4.9, the standard error of is

我们可以使用精确的二项分布,但当 时,即使样本量较小,二项分布的正态近似也是合理且更简便的。我们计算检验统计量 如下:
We could use the exact Binomial distribution, but the Normal approximation to the Binomial is reasonable when even for small samples, and is simpler to use. We calculate the test statistic, ,as

从表B2中,正态分布对应于 的双尾概率为 。如果我们使用 ,则得到 ,这将给出相同的双尾 值。
From Table B2 the two- sided tail area of the Normal distribution corresponding to is . If we had used ,we would have arrived at , which would give the same two- sided value. Thus the difference between the observed data and the recommended

因此,观察数据与推荐值之间的差异在大约 水平上具有统计学显著性,我们推断这些女性的平均每日摄入量确实低于推荐水平。
value is statistically significant at about the level, and we infer that the average daily intake of these women really is lower than the recommended level.

关于符号检验,还需要进一步说明两点。首先,最好在检验中加入连续性校正。当连续分布被用来近似非连续数据时(如本例所示),我们通常会使用连续性校正。该调整是将观察计数 与假设值 之间的差值减少 。我们写作 ,其中竖线表示取绝对值;也就是说,如果 为负,则忽略其符号。用连续性校正重新计算检验统计量得到:
Two further comments are needed in relation to the sign test. Firstly, it is preferable to incorporate a continuity correction into the test. We use the continuity correction in several circumstances when a continuous distribution is used as an approximation to non- continuous data, as is the case here. The adjustment involves reducing the difference between the observed count and the hypothesized value by . We write this as , where the vertical bars indicate that we take the absolute value of ; that is we ignore the sign if is negative. Recalculating our test statistic with the continuity correction gives

连续性校正的使用不可避免地会降低 值并增加 值,但如果不使用校正,计算结果在拒绝原假设时会显得稍微“乐观”一些。由于样本较小,校正后的 值明显较低,但在大样本中这种影响很小。根据表 B2, 值为 1.81 对应的双侧 值为 0.07,因此这种更为准确的检验方法得出的结果在 5% 显著性水平下并不完全显著。连续性校正应始终用于小样本,并且可以常规地纳入分析中。
Inevitably the use of the continuity correction will reduce and increase , but without the correction the calculations are a little too 'optimistic' in favour of rejecting the null hypothesis. Because we have a small sample the corrected value of is quite a lot smaller, but in large samples the effect is minimal. From Table B2 a value of 1.81 corresponds to a two- sided value of 0.07, so that this more correct version of the test gives a result that is not quite significant at the level. The continuity correction should always be used for small samples and can be incorporated routinely.

其次,如果有任何观察值恰好等于假设值,则在计算中忽略该观察值。因此,符号检验的样本量是与假设值不同的观察值数量。
Secondly, if any of the observations is exactly the same as the hypothesized value then we ignore that observation in the calculation. Thus the sample size for the sign test is the number of observations that differ from the hypothesized value.

符号检验是最基本的假设检验之一,并且以不同形式出现,作为解决其他问题的方法,最著名的是用于比较配对比例的McNemar检验(见第10.7.5节)。
The sign test is one of the most basic of hypothesis tests, and occurs in different guises as the solution to other problems, most notably as the McNemar test for comparing paired proportions (section 10.7.5).

9.4.5 威尔科克森符号秩和检验 9.4.5 The Wilcoxon signed rank sum test

符号检验仅考虑每个观察值是否高于或低于所选的感兴趣值。更理想的是考虑观察值的大小,我们可以通过使用威尔科克森符号秩和检验来实现。该方法包括三个步骤:
The sign test considers only whether each observation is above or below the chosen value of interest. It is preferable to take some account of the magnitude of the observations and we can do this by using the Wilcoxon signed rank sum test. The method has three steps:

1.计算每个观察值与感兴趣值的差异;

  1. calculate the difference between each observation and the value of interest;

2.忽略差异的符号,将其按大小排序;
2. ignoring the signs of the differences, rank them in order of magnitude;
3.计算所有负秩(或正秩)的秩和,对应于低于(或高于)所选假设值的观察值。
3. calculate the sum of the ranks of all the negative (or positive) ranks, corresponding to the observations below (or above) the chosen hypothetical value.

虽然该方法对观察值的分布形式没有特定假设,但假设它们来自对称分布的人群。对于单样本检验,这不是一个重要的考虑(但参见第9.7.2节)。
Although this method makes no assumptions about the particular form of the distribution of the observations, it does assume that they come from a population with a symmetric distribution. This is not an important consideration for a single sample test (but see section 9.7.2).

对于小样本(最多25个),P值可以从表B9获得。对于较大样本,检验统计量近似服从正态分布,均值为 ,方差为 。与符号检验一样,零差异在计算中被忽略,因此公式中的 是非零差异的数量,可能小于样本量。
For small samples (up to 25) P values can be obtained from Table B9. For larger samples the test statistic has an approximately Normal distribution, with mean and variance . As with the sign test, zero differences are omitted from the calculations, so in this formula is the number of non- zero differences, and so may be less than the sample size.

表9.2显示了表9.1中11名女性的膳食摄入量及其与推荐摄入量的差异。同时显示了差异的秩,忽略了符号。两个高于推荐摄入量 的观察值的秩和为 ,根据表B9,。我们也可以使用低于推荐摄入量的摄入值的秩和,即 ,根据表B9同样得出 。检查秩是否正确分配总是值得的。
Table 9.2 shows the dietary intakes of 11 women from Table 9.1 together with the differences from the recommended intake. Also shown are the ranks of the differences, ignoring their signs. The sum of the ranks of the two observed intakes above the recommended is , so from Table B9 we get . We could equally well have used the sum of the ranks of the intakes below the recommended intake, which is , which from Table B9 also gives . It is always worth checking that the ranks have been

表9.2 11名健康女性每日能量摄入及其与推荐摄入量差值的秩次(忽略符号)
Table 9.2 Daily energy intake of 11 healthy women with rank order of differences (ignoring their signs) from the recommended intake of

受试者每日能量摄入(kJ)与7725 kJ的差值差值秩次
15260246511
25470225510
3564020859
4618015458
5639013357
6651512106
768059204
875152101.5
975152101.5
108230-5053
118770-10455
SubjectDaily energy intake (kJ)Difference from 7725 kJRanks of differences
15260246511
25470225510
3564020859
4618015458
5639013357
6651512106
768059204
875152101.5
975152101.5
108230-5053
118770-10455

秩次计算正确,这很简单,因为所有秩次之和为 。这里为 ,同时
calculated correctly, which is easy because the sum of all the ranks is . Here we have and also .

一个重要的一般性观点是,我们不期望不同的检验方法在相同数据上给出完全相同的结果。它们的假设不同,且利用了观测数据的不同方面。一般而言,两种有效方法会得出相似的结论。然而,在小样本情况下,非参数方法的检验力较弱,因此如上例所示,非参数方法往往给出较不显著(较大)的P值,较参数方法更保守。
An important general point is that we do not expect different tests to give the same answer when applied to the same data. They do not make the same assumptions and use different aspects of the observations. In general, however, two valid methods will lead to similar answers. In small samples, however, non- parametric methods are rather lacking in power and so, as in the above example, will tend to give a less significant (larger) P value than the equivalent parametric test.

实际中,我们通常只对一组数据进行一次分析,选择参数方法或非参数方法。除非有明确迹象表明参数方法不适用(即基本假设不满足),否则我们通常采用参数方法。
In practice we usually perform only one analysis of a set of data, choosing between parametric or non- parametric alternatives. We usually use a parametric method unless there is some clear indication that it is not valid, that is if the underlying assumptions are not met.

9.5 两组配对观测数据 9.5 TWO GROUPS OF PAIRED OBSERVATIONS

当我们有多组观测数据时,区分配对数据和独立组数据至关重要。配对数据出现于同一受试者在不同条件下多次测量的情况。此外,当两个不同组的受试者经过个体匹配(例如匹配对病例对照研究)时,也应将数据视为配对。
When we have more than one group of observations it is vital to distinguish the case where the data are paired from that where the groups are independent. Paired data arise when the same individuals are studied more than once, usually in different circumstances. Also, when we have two different groups of subjects who have been individually matched, for example in a matched pair case- control study, then we should treat the data as paired.

前一节分析的膳食摄入数据来源于一项研究,11名女性连续60天记录膳食摄入。她们并不知道研究目的是比较月经周期前后期的摄入量。表9.1中分析的是月经前的膳食摄入。表9.3显示了同一女性一个周期内月经前后期的膳食摄入平均值,结果显示每位女性月经后期的平均每日摄入均低于月经前期。
The dietary intake data analysed in the previous section come from a study in which the 11 women recorded their dietary intake for 60 consecutive days. They were unaware that the purpose of the study was to compare intake on the pre- and post- menstrual days of the menstrual cycle. The data in Table 9.1 already analysed were pre- menstrual dietary intakes. Table 9.3 shows both the pre- menstrual and post- menstrual dietary intakes for one cycle for the same women, from which we see that each woman's post- menstrual average daily intake was lower than her pre- menstrual intake.

对配对数据,我们关注每个个体观测值的平均差异及其差异的变异性。重点是个体内差异的变异,而非个体间的变异。通常我们不关心个体间的变异,且这类变异可能掩盖我们关注的效应。配对设计的优势在于通过仅关注个体内差异,消除个体间变异,这构成了后续分析方法的基础。通过分析差异,问题简化为单样本问题,因此可以使用与前节类似的方法。因为我们将个体内差异视为单一样本,故分析的重点是这些差异。
With paired data we are interested in the average difference between the observations for each individual and the variability of these differences. We are thus interested in the variability of the within- subject differences rather than between- subject variation. In general we are not particularly interested in variation between subjects, and indeed such variability may obscure the effects that we are interested in. The strength of the paired design is that we can remove between- subject variability by looking only at within- subject differences, and these thus form the basis for the method of analysis to be described. By looking at differences we effectively reduce the analysis to a one sample problem, so that we can use very similar methods to those discussed in the previous section. Because we treat the within- subject differences as a single sample, it is these differences which

表9.3 10个经期前后每日膳食摄入均值(Manocha等,1986)
Table 9.3 Mean daily dietary intake over 10 pre-menstrual and 10 post-menstrual days (Manocha et al., 1986)

受试者膳食摄入量(千焦)
经前期经后期差值
1526039101350
2547042201250
3564038851755
4618051601020
563905645745
6651546801835
7680552651540
8751559751540
975156790725
10823069001330
11877073351435
均值6753.65433.21320.5
标准差1142.11216.8366.7
SubjectDietary intake (kJ)
Pre-menstrualPost-menstrualDifference
1526039101350
2547042201250
3564038851755
4618051601020
563905645745
6651546801835
7680552651540
8751559751540
975156790725
10823069001330
11877073351435
Mean6753.65433.21320.5
SD1142.11216.8366.7

我们要求数据近似正态分布,但不要求每组数据都必须服从正态分布。
we require to have an approximately Normal distribution. There is no requirement for each set of data to be Normally distributed.

9.5.1 均值差的置信区间 9.5.1 Confidence interval for the difference between means

表9.3显示了每位女性经前期与经后期膳食摄入量的差值,以及这些差值的均值和标准差。我们可以将这些差值视为一个单一样本数据,使用第9.4节介绍的方法进行估计和假设检验。
Table 9.3 shows the difference in dietary intake between the pre- and post- menstrual days for each woman, and the mean and standard deviation of the differences. We can treat the differences as if they were a single sample of observations and use the methods introduced in section 9.4 for estimation and hypothesis testing.

因此,我们使用自由度为10、尾部概率为0.05对应的值,即。经前期与经后期差值的标准差为366.7,均值差的标准误为。因此,均值差的95%置信区间为
Thus, we use the same value corresponding to a tail area of 0.05 with 10 degrees of freedom, which is . The standard deviation of the differences between the pre- and post- menstrual days is 366.7, so the standard error of the mean difference is . The confidence interval for the mean difference is thus

即1074.2到。整个置信区间远大于零,表明我们可以相当确定,经后期的膳食摄入量普遍明显降低。注意,该置信区间明显比经前期膳食摄入均值的置信区间(5986到)窄得多,因为我们消除了个体间的变异。
or 1074.2 to . The whole confidence interval is much greater than zero, indicating that we can be reasonably sure that, in general, dietary intake is much lower in the post- menstrual period. Note that this confidence interval is considerably narrower than that for the mean pre- menstrual intake (5986 to ) because we have removed between- subject variability.

9.5.2 配对t检验 9.5.2 Paired t test

我们可以使用单样本检验计算均值差的值。这里我们希望比较观察到的均值差,即1320.5 kJ,与假设值零,即原假设为经前期与经后期膳食摄入量相同。值计算如下:
We can use the one sample test to calculate a value for the comparison of means. Here we wish to compare the observed mean difference of with a hypothetical value of zero, i.e. the null hypothesis is that pre- and post- menstrual dietary intake is the same. The value is then given by

自由度为10。根据表B4,我们可以看到11.94远大于分布中的临界值,因此远小于0.001。通常写作即可。(实际的值实际上是0.0000003。)
on 10 degrees of freedom. From Table B4 we can see that 11.94 is much larger than the value of the distribution, so that is considerably less than 0.001. It will usually suffice to write . (The actual value is in fact 0.0000003. )

9.5.3 非参数方法 9.5.3 Non-parametric methods

我们也可以对配对观察值的差异应用单样本符号检验。对于表9.3中的数据,所有11个差异符号相同,因此带连续性校正的检验统计量为
We can also apply the one sample sign test to the differences between paired observations. For the data in Table 9.3 all 11 differences have the same sign, so the test statistic, with the continuity correction, is

根据表B2,这对应于
which, from Table B2, corresponds to

我们还可以对配对数据应用Wilcoxon检验,同样是直接对每个个体的差异进行处理。这种形式的检验称为Wilcoxon配对符号秩和检验。这里不使用相同的膳食数据来说明该检验(结果已十分明确),而是在第9.7.2节中介绍一些新数据,以展示Wilcoxon检验的一个缺点。
We can also apply a Wilcoxon test to paired data, again by working directly on the differences for each individual. In this form the test is called the Wilcoxon matched pairs signed rank sum test. Rather than illustrate the test on the same dietary data, for which the result is clear cut, I shall look at the method on some new data in section 9.7.2, where a drawback of the Wilcoxon test is illustrated.

9.6 两个独立观察组 9.6 TWO INDEPENDENT GROUPS OF OBSERVATIONS

最常用的统计分析可能是比较两个独立观察组。大多数临床试验产生这类数据,观察性研究中比较不同受试者组的数据也属于此类。对于连续数据,我们可以使用参数或非参数方法,下面将依次介绍。
The most common statistical analyses are probably those used for comparing two independent groups of observations. Most clinical trials yield data of this type, as do observational studies comparing different groups of subjects. For continuous data we can again use either parametric or non- parametric methods, and these will be described in turn.

对于配对数据,我们将配对观察值的差异视为一个单一样本。
With paired data we treated the differences between paired observations

均差的标准误差用于置信区间和配对检验,基于每个受试者内的差异,因此不受受试者间变异的影响。
as a single sample. The standard error of the mean difference, which was used for both the confidence interval and paired test, was based on the differences within each subject, and was thus unaffected by the variability between subjects.

对于独立组的观测值,我们仍然关注组间均值的差异,但个体间的变异性变得重要。置信区间和两样本 检验都基于假设:每组观测值均来自正态分布的总体,且两个总体的方差相等。正态性假设是熟悉的,处理方式与之前相同。方差相等的假设此前未曾涉及,我将在后文展示如何正式检验此假设,并讨论当样本方差不相似时的应对方法。
With independent groups of observations we are again interested in the mean difference between the groups, but the variability between subjects becomes important. Both the confidence interval and the two sample tests are based on the assumption that each set of observations is sampled from a population with a Normal distribution, and that the variances of the two populations are the same. The assumption of Normality is familiar, and is dealt with in the same way as previously. The assumption of equal variances has not been met before. I shall show later how to examine this assumption formally, and discuss what to do when the sample variances are not similar.

9.6.1 均值差异的置信区间 9.6.1 Confidence interval for difference between means

单组观测均值的标准误来源于数据的标准差,进而来自方差。对于两个样本,我们关注的是两个均值差异的方差。可以证明,我们所需的标准误基于两个方差的加权平均,其中较大样本的权重更高。
The standard error of the mean of one group of observations is derived from the standard deviation of the data and hence from the variance. With two samples we are interested in the variance of the difference between the two means. It can be shown that the standard error we need is based on the average of the two variances, but giving more weight to the larger sample.

所需的标准误计算公式比单样本情况更复杂,但仅涉及每组的均值、方差和样本量。首先计算合并方差 ,公式为:
The required standard error is obtained from a more complicated formula than for the one sample case, but it involves only the mean, variance and sample size for each group. First we calculate the pooled variance, , as

其中, 分别为两个样本组的标准差,样本量为 。用 表示两个样本的均值, 为合并标准差,则有:
where and are the standard deviations of the two groups of sizes and . Using and to denote the means of the two samples, and as the pooled standard deviation, we have

每个组对 的自由度贡献为 。获得均值差异的标准误后,我们可以构建置信区间。均值差异的 置信区间为:
Each group contributes to the degrees of freedom associated with s, to give degrees of freedom. Having acquired the standard error of the difference between the means we can produce a confidence interval. The confidence interval for the difference between the means is given by

其中 值对应自由度为
where the value of has degrees of freedom.

表9.4 24小时总能量消耗(MJ/天)在瘦女性组和肥胖女性组中的数据(Prentice等,1986)
Table 9.4 24 hour total energy expenditure (MJ/day) in groups of lean and obese women (Prentice et al., 1986)

瘦组 (n = 13)肥胖组 (n = 9)
6.138.79
7.059.19
7.489.21
7.489.68
7.539.69
7.589.97
7.9011.51
8.0811.85
8.0912.79
8.11
8.40
10.15
10.88
均值8.06610.298
标准差1.2381.398
Lean (n = 13)Obese (n = 9)
6.138.79
7.059.19
7.489.21
7.489.68
7.539.69
7.589.97
7.9011.51
8.0811.85
8.0912.79
8.11
8.40
10.15
10.88
Mean8.06610.298
SD1.2381.398

表9.4显示了瘦女性组和肥胖女性组的24小时能量消耗。肥胖组的平均能量消耗为10.3 MJ/天,高于瘦组的8.1 MJ/天,且两组的标准差非常接近。合并标准差为
Table 9.4 shows the 24 hour energy expenditure of groups of lean and obese women. The obese group had a higher mean energy expenditure of 10.3 compared with 8.1 MJ/day for the lean group and the two standard deviations were very similar. The pooled standard deviation is

平均摄入差异的标准误为
The standard error of the difference in mean intakes is given by

两组平均摄入的差异为2.232 MJ/天。为了构建平均差异的95%置信区间,我们需要20自由度下的值,表B4显示该值为2.086。因此,肥胖组与瘦组24小时能量消耗平均差异的95%置信区间为
The difference in the mean intakes of the two groups was 2.232 MJ/day. To construct the 95% confidence interval for the mean difference we need the value of on 20 degrees of freedom, which Table B4 shows is 2.086. The 95% confidence interval for the mean difference in 24 hour energy expenditure between obese and lean women is thus

或者 1.05 到 3.41 兆焦/天。
or 1.05 to 3.41 MJ/day.

9.6.2 两独立样本 t 检验 9.6.2 Two sample t test

还有一种适用于比较两个独立数据组的 检验。两独立样本 检验与单样本或配对 检验非常相似,统计量由下式计算:
There is also a test appropriate for comparing two independent groups of data. The two sample test looks much the same as the single sample or paired tests, the statistic being obtained from

并与自由度为 分布进行比较。我们已经计算出均值差的标准误为 0.5656 兆焦/天,因此 ,自由度为 20,得到 。可以说,肥胖女性的总能量消耗显著高于瘦女性。
and compared with the distribution with degrees of freedom. We have already calculated the standard error of the difference in the means as 0.5656 MJ/day, so we have on 20 degrees of freedom, giving . We can say that the total energy expenditure in the obese women was highly significantly greater than that of the lean women.

几乎所有统计软件包都包含两独立样本 检验,但遗憾的是,如果你已经计算了均值和标准差,很少有软件能直接进行计算。因此,如果你想用已发表论文中的汇总统计量计算置信区间或 检验,可能需要手工计算,使用前一节中给出的公式。
Virtually all statistical computer packages include the two sample test. but unfortunately very few will do the calculations if you have already calculated the mean and standard deviation. Thus if you wish to calculate a confidence interval or test using summary statistics from a published paper you will probably have to perform the calculations by hand, using the equations given in the previous section.

9.6.3 中位数差的置信区间 9.6.3 Confidence interval for difference between medians

有一种非参数方法可构建两组观测中位数差的置信区间。该方法要求样本来自形状相同、仅位置不同的分布(因此也是两均值差的非参数置信区间)。此方法使用不广泛,操作较复杂,故此处不详述。该方法由 Campbell 和 Gardner(1989)描述。
There is a non- parametric method for constructing a confidence interval for the difference between the medians of two groups of observations. It requires the restrictive assumption that the samples are from populations with distributions that are identical in shape, and differ only by a shift in location. (It is thus also a non- parametric confidence interval for the difference between two means.) This method is not widely used and is rather complicated to carry out, so details are not given here. The method is described by Campbell and Gardner (1989).

9.6.4 两组非参数比较—Mann-Whitney 检验 9.6.4 Non-parametric comparison of two groups - the Mann-Whitney test

有一种非参数的替代方法可以用于比较两个独立组的数据,即检验的非参数替代方法。该检验有两个推导版本,一个由Wilcoxon提出,另一个由Mann和Whitney提出。为了避免与Wilcoxon提出的配对检验混淆,最好称该方法为Mann-Whitney检验,尽管有些人称其为Mann-Whitney-Wilcoxon检验。
There is a non- parametric alternative to the test for comparing data from two independent groups. There are two derivations of the test, one due to Wilcoxon and the other to Mann and Whitney. It is better to call the method the Mann- Whitney test to avoid confusion with the paired test also due to Wilcoxon, although some people refer to the test as the Mann- Whitney- Wilcoxon test.

Mann-Whitney检验要求将所有观察值按单一样本进行排序。然后计算其中一组的秩和,并从表B10中查找对应的值。表9.5展示了按此方法处理的能量消耗数据。两组的秩和分别为
The Mann- Whitney test requires all the observations to be ranked as if they were from a single sample. Then the sum of the ranks in one group is calculated and a value found from Table B10. Table 9.5 shows the energy expenditure data treated in this way. The sums of the ranks in the

表9.5 Mann-Whitney 检验在表9.4中的能量消耗(EE,单位MJ/天)数据上的计算
Table 9.5 Calculations for the Mann-Whitney test on energy expenditure (EE) data (MJ/day) in Table 9.4

瘦人组 (n = 13)肥胖组 (n = 9)
秩次能量消耗能量消耗秩次
16.13
27.05
3.57.48
3.57.48
57.53
67.58
77.90
88.08
98.09
108.11
118.40
8.7912
9.1913
9.2114
9.6815
9.6916
9.9717
1810.15
1910.88
11.5120
11.8521
12.7922
秩和 = 103秩和 = 150
Lean (n = 13)Obese (n = 9)
RankEEEERank
16.13
27.05
3.57.48
3.57.48
57.53
67.58
77.90
88.08
98.09
108.11
118.40
8.7912
9.1913
9.2114
9.6815
9.6916
9.9717
1810.15
1910.88
11.5120
11.8521
12.7922
Sum = 103Sum = 150

两组的秩和分别为103和150。(我们可以通过所有个观察值的秩和必须是来检验计算,这里为253。)现在我们可以使用两种替代统计量,。统计量(由Wilcoxon提出)简单地是较小组的秩和,在本例中为150。(如果两组大小相同,可以任选一组。)统计量(由Mann和Whitney提出)更复杂,其计算公式为
two groups are 103 and 150. (We can check our calculations by noting that the sum of all ranks of observations must be , which here is 253. ) We can now use two alternative statistics, and . The statistic (due to Wilcoxon) is simply the sum of the ranks in the smaller group, 150 in our example. (Either group can be taken if they are of the same size.) The statistic (due to Mann and Whitney) is more complicated, being calculated as

使用 的优点在于它是少数具有有用解释的非参数统计量之一。 表示所有可能的观察对数,这些观察对由两个样本中各取一个观察值组成,记为 ,其中满足 。因此,若样本容量分别为 ,则 是所有此类观察对中满足条件的比例,也即估计了来自第一个总体的新观察值小于来自第二个总体的新观察值的概率。
The advantage of using is that it is one of the few non- parametric statistics that has a useful interpretation. is the number of all possible pairs of observations comprising one from each sample, say and , for which . Thus if the sample sizes are and then is the proportion of all such pairs, and so is also the estimated probability that a new observation from the first population will be less than a new

由于其解释性,使用 Mann-Whitney 统计量进行计算机分析更为合适;但手工计算时,Wilcoxon 统计量则更易获得。
observation sampled from the second population. For analysis by computer the Mann- Whitney statistic is thus preferable because of its interpretation, but for hand calculation the Wilcoxon statistic is much easier to obtain.

对于小样本,可以通过考虑样本容量为 时所有可能的秩和分布来评估检验统计量的观察值。举个简单的例子,若样本容量分别为2和5,则7个观察值的排列组合数量较少。较小组中两个值的秩必定是以下21种组合之一:
For small samples it is possible to evaluate the observed value of the test statistic by considering the distribution of all the possible sums of ranks with samples of size and . To take a simple example, if we have samples of sizes 2 and 5, there are only a small number of possible orderings of the seven observations. The ranks of the two values in the smaller group must be one of the following 21 combinations:

每种组合对应的秩和如下:
Each combination yields a sum of ranks as follows

如果原假设为真,这些可能性中任何一种出现的概率都是相等的,因为组间没有差异。对于任意一对样本量,都可以使用相同的程序来获得可能的秩和分布,从而计算得到任何特定秩和(或更极端秩和)的概率。因此,我们可以计算出在任意显著性水平下与原假设相容的秩和值范围。由此得到的 值称为精确概率。在上述例子中,观察到的秩和为5,对应的精确单边 值为 ,即0.19,因此双边 值为0.38。
If the null hypothesis is true, any one of these possibilities is equally likely because there is no difference between the groups. For any pair of sample sizes the same procedure can be used to get the distribution of possible rank sums, from which the probability of obtaining any particular rank sum (or a more extreme one) can be calculated. Thus we can calculate the range of values of the rank sum that is compatible with the null hypothesis at any level of significance. The values thus obtained are known as exact probabilities. In the above example, an observed rank sum of 5 would correspond to an exact one- sided value of , or 0.19, so that the two- sided value is 0.38.

表B10给出了统计量 的临界值,显示当样本量分别为9和13时,秩和150超出了原假设下 的预期秩和范围,但未超出 的范围,因此我们写作
Table B10 gives these critical values of the statistic , showing that with sample sizes of 9 and 13 the rank sum of 150 is outside the range of expected rank sums under the null hypothesis but not outside the range, so we write .

对于每组样本量大约十个或更多的情况,统计量 近似服从正态分布,其均值为 ,标准差为 ,其中 分别是较小组和较大组的样本量。由此我们可以计算检验统计量 ,并参考正态分布表(表B2)。
For larger samples of about ten or more in each group the statistic has an approximately Normal distribution with mean and standard deviation , where and are the sample sizes in the smaller and larger group respectively. From these we can calculate the test statistic as and refer to tables of the Normal distribution (Table B2).

对于上述样本量为9和13的例子,使用大样本近似是合理的。原假设下检验统计量的均值和标准差为
It is reasonable to use the large sample approximation for the above example with sample sizes 9 and 13. The mean and standard deviation of the test statistic under the null hypothesis are given by

以及
and

由此得到
giving

根据表B2,该值对应的 。对于统计量 也有相应的大样本近似,具体细节见Bland(1987,第223页)。
which, from Table B2, corresponds to . There is an equivalent large sample approximation for the statistic ; details are given by Bland (1987, p. 223).

曼-惠特尼检验如前所述,基于无并列秩次的假设。如果存在大量相同的数据值,则应对大样本公式进行复杂的修正。计算机软件应自动调整并列秩次,但并非所有软件都具备此功能。
The Mann- Whitney test as described is based on the assumption that there are no tied ranks. If there are many identical data values complicated corrections should be applied to the large sample formula. Computer packages ought automatically to adjust for tied ranks, but not all do.

计算机程序中的非参数方法可能会使用大样本正态近似,即使样本量较小。对于小样本,建议将计算得到的统计量(如果提供)与相应的表格进行核对。然而,统计量的具体含义并不总是明确。例如,在 Minitab(6.1 版本)中,计算的是第一个样本的 统计量(不一定是较小的样本),但其被称作
Non- parametric methods in computer programs may use the large sample Normal approximation, even for small samples. For small samples it is advisable to check the calculated statistic (if given) against the appropriate table. However, it is not always clear which statistic is given. For example, in Minitab (release 6.1) is calculated for the first sample (not necessarily the smaller sample) but it is called .

9.6.5 不等方差 9.6.5 Unequal variances

有时我们希望比较两个观察组,在正态性假设合理的情况下,但两组的变异性明显不同。这里有两个问题:方差必须差异多大才不适合使用两样本 检验?如果出现这种情况,我们该如何处理?
Sometimes we wish to compare two groups of observations where the assumption of Normality is reasonable, but the variability in the two groups is markedly different. Two questions arise: how different do the variances have to be before we should not use the two sample test, and what can we do if this happens?

众所周知, 检验具有“稳健性”,即对假设的适度偏离影响较小。无法明确说明两组方差差异多大时不能使用 检验。然而, 检验基于两总体方差相等的假设,因此我们可以用 检验来检验零假设,即两方差相等。
The test is known to be 'robust' in that it is little affected by moderate failure to meet the assumptions. It is not possible to say how different the variances in the two groups can be before we cannot use the test. However, the test is based on the assumption that the two population variances are the same, so we can test the null hypothesis that this is so, using the test.

检验或方差比检验非常简单。在零假设下,两个正态分布总体方差相等时,我们期望两个样本方差的比值服从称为 分布的抽样分布。方差比是
The test or variance ratio test is very simple. Under the null hypothesis that two Normally distributed populations have equal variances we expect the ratio of the two sample variances to have a sampling distribution known as the distribution. The variance ratio is the ratio of

样本方差的比值,或者样本标准差比值的平方。我们通过取较大标准差除以较小标准差计算样本中观察到的方差比,并在表 B6 中查找该值的平方。 统计量的分布有两个自由度值,分别对应两个方差。
the sample variances or the square of the ratio of the sample standard deviations. We calculate the variance ratio observed in our sample, by taking the larger standard deviation divided by the smaller, and look up the square of this value in Table B6. The distribution of the statistic has two values of degrees of freedom, one corresponding to each variance.

表 9.6 显示了16名诊断为甲状腺功能减退的婴儿的血清甲状腺素测量值。我们希望比较按症状严重程度分组的甲状腺素水平,但标准差明显不同。方差比为 。我们利用表 B6 将6.95与自由度分别为6和8的 分布比较,其中第一个自由度对应分子(37.48),第二个对应分母(14.22),两者均为观察数减一。由于我们取的是较大方差与较小方差的比值,因此只考虑 分布的上尾概率。结果为 ,因此两样本来自方差相同总体的可能性很小。
Table 9.6 shows serum thyroxine measurements from 16 infants diag. nosed as hypothyroid. We wish to compare thyroxine levels in two groups defined by severity of symptoms, but the standard deviations are markedly different. The ratio of variances is . We use Table B6 to compare 6.95 with the distribution with 6 and 8 degrees of freedom, the first value relating to the numerator (37.48) and the second to the denominator (14.22), and both being one less than the number of observations. Because we take the ratio of the larger variance to the smaller we consider only the upper tail of the distribution. We get , so it is unlikely that the two samples come from populations with the same variance.

此时我们不应使用两样本 检验比较两均值。我们可以改用曼-惠特尼检验,也可以使用针对不等方差情况的 检验修正方法,即 Welch 检验,本书未涉及(参见 Armitage 和 Berry,1987,第110页)。不过,如果样本量较大,可以使用第8.4节描述的大样本正态分布方法,此方法不要求各组方差相同。
We should not now use the two sample test to compare the two means. We could instead use the Mann- Whitney test, but we could also use a modification of the test for the case with unequal variances, known as the Welch test, which is not covered in this book (see Armitage and Berry. 1987, p. 110). If, however, the samples are large we can use the large sample Normal distribution methods described in section 8.4, for which there is no requirement that the groups have the same variance.

表 9.6 16名甲状腺功能减退婴儿按症状严重程度分组的血清甲状腺素水平(单位:)(Hulse 等,1979)
Table 9.6 Serum thyroxine level in 16 hypothyroid infants by severity of symptoms (Hulse et al., 1979)

轻微或无症状 (n = 9)明显症状 (n = 7)
均值345
458
4918
5524
5860
5984
6096
62
86
标准差14.22
Slight or no symptoms (n = 9)Marked symptoms (n = 7)
Mean345
458
4918
5524
5860
5984
6096
62
86
SD14.22

9.7 偏态数据分析 9.7 ANALYSIS OF SKEWED DATA

检验的使用基于假设:每组数据(独立样本)或差值(配对样本)近似服从正态分布,并且对于两样本情况,还要求两组方差相似。我们有时会发现至少有一项要求未被满足。当数据偏态时,我们可以使用非参数方法,或尝试对原始数据进行变换。
The use of the test is based on the assumption that the data for each group (with independent samples) or the differences (with paired samples) have an approximately Normal distribution, and for the two sample case we also require the two groups to have similar variances. We sometimes find that at least one requirement is not met. When the data are skewed we can either use a non- parametric method, or try a transformation of the raw data.

最有用的变换是对数变换。它具有一个特殊性质,即可以获得与原始数据相关的组间差异的置信区间。没有其他变换具有此性质。幸运的是,取对数常常能成功消除偏态,同时使方差更趋于一致。
The most useful transformation is the logarithmic transformation. It has the special property that it is possible to get a confidence interval for the difference between the groups that relates to the original data. No other transformation has this property. Fortunately taking logs is very often successful in removing skewness and also making variances more equal.

我将用一项研究的数据来说明配对样本分析。
I shall illustrate the paired samples analysis using data from a study of

表9.7 显示了20名霍奇金病缓解患者和20名弥漫性恶性肿瘤缓解患者(Shapiro等,1986)血样中细胞数()。
Table 9.7 Numbers of and cells in blood samples from 20 patients in remission from Hodgkin's disease and 20 patients in remission from disseminated malignancies (Shapiro et al., 1986)

霍奇金病非霍奇金病
T4T8T4T8
396836375340
568978375330
12121678752627
171212208153
554670151101
1104133511672
257272736449
435446192108
295262315177
3973401252575
288236675318
1004786700320
431311440200
795449771289
1621811688263
1378686426157
902412410140
958286979310
1283336377108
2415936503163
均值823.2613.9522.1260.0
标准差566.4397.9293.0154.7
Hodgkin's diseaseNon-Hodgkin's disease
T4T8T4T8
396836375340
568978375330
12121678752627
171212208153
554670151101
1104133511672
257272736449
435446192108
295262315177
3973401252575
288236675318
1004786700320
431311440200
795449771289
1621811688263
1378686426157
902412410140
958286979310
1283336377108
2415936503163
Mean823.2613.9522.1260.0
SD566.4397.9293.0154.7

研究了霍奇金病缓解患者和多种弥漫性恶性肿瘤缓解患者(称为非霍奇金病组)淋巴细胞异常。每组各有20名患者,组间无配对。表9.7列出了血液中每立方毫米细胞的数量。除了细胞的实际水平,作者特别关注(辅助细胞)与(抑制细胞)细胞数的比值,因此数据按各组内比值升序排列。表9.7还显示了每组观察值的均值和标准差。标准差均大于均值的一半,强烈暗示(对于不可能为负的变量)数据呈偏态。此外,较大的均值对应较大的标准差,提示对数变换可能适用。
lymphocyte abnormalities in patients in remission from Hodgkin's disease or diverse, disseminated malignancies (called the non- Hodgkin's disease group). There were 20 patients in each group, but no pairing between the groups. Table 9.7 shows the numbers of and cells per in their blood. As well as the actual levels of and cells, the authors were particularly interested in the ratio of the numbers of cells (helper cells) to cells (suppressor cells), so the data are tabulated in ascending order of the ratio within each group. Table 9.7 also shows the mean and standard deviation of each group of observations. The standard deviations are all greater than half the mean, strongly suggesting (for variables where negative values are impossible) that the data are skewed. Also the standard deviations are larger for the larger means, which suggests that a log


图9.2 显示了20名霍奇金病缓解患者和20名弥漫性恶性肿瘤缓解患者(非霍奇金病)中(细胞/mm³)的直方图。
Figure 9.2 Histograms of and (cells/mm³) in 20 patients in remission from Hodgkin's disease and 20 patients in remission from disseminated malignancies (non-Hodgkin's disease).


图9.3 显示了的直方图。
Figure 9.3 Histograms of and

变换可能同时成功地消除了偏斜,并使变异性更加相似。
transformation may be successful both in removing skewness and making the variability more similar.

图9.2显示了原始数据的直方图,清楚地展示了偏斜和不等散布。图9.3展示了对数变换在生成看似正态且标准差相似的数据方面的成功。这些数据中的部分已在图7.1中以图形方式展示。图9.4显示,对数变换还使得 的差异更接近正态分布,尤其是在非霍奇金病组中。
Figure 9.2 shows histograms of the raw data, clearly showing the skewness and unequal scatter. Figure 9.3 shows the success of the log transformation in producing data that are plausibly Normal and have similar standard deviations. Some of these data were shown graphically in Figure 7.1. Figure 9.4 shows that log transformation has also made the differences more Normal, especially in the non- Hodgkin's disease group.

(a) Raw data

霍奇金病 非霍奇金病 (n=20) (n=20) 区间下限
Hodgkin's disease Non- Hodgkin's disease (n=20) (n=20) Lower limit of interval

  • 600 ***

  • 400

  • 200

0

****

200

400

600

800

1000

1200

1400

1500

平均值
Mean

标准差
SD

(b)对数数据
(b) Log data

  • 0.75

  • 0.50

  • 0.25

0.00

0.25

0.50

0.50

0.75

1.00

1.25

平均值
Mean

标准差
SD

图9.4 (a) 和 (b) 的直方图
Figure 9.4 Histograms of (a) and (b)

9.7.1 参数分析 9.7.1 Parametric analysis

(a) 置信区间 (a) Confidence Interval

我们可以使用配对 检验比较霍奇金病组中 细胞数量的对数,并计算置信区间,
We can use the paired test to compare the logs of the numbers of and cells in the Hodgkin's disease group and calculate a confidence interval,

使用前面给出的方法。从图9.4中, 计数差值的均值和标准差分别为0.25和0.569,因此均值的标准误为 。19个自由度下的 值为2.093,所以霍奇金病患者中 细胞计数均值差的95%置信区间为
using the methods given earlier. From Figure 9.4 the mean and standard deviation of the differences between the and counts are 0.25 and 0.569, so the standard error of the mean is . The value of on 19 degrees of freedom is 2.093, so the confidence interval for the mean difference between and cell counts in patients with Hodgkin's disease is given by

到 0.516。
or to 0.516.

该置信区间针对的是 ,但我们通常更关心原始数据尺度上的置信区间。我们可以这样做,因为两个数的对数差等于它们比值的对数,即 。因此,对数差的均值的反对数将是变量比值的几何均值的估计。 的均值为0.25,因此 的几何均值为 。此外,我们可以将对数差均值的置信区间“反变换”得到 比值的几何均值的置信区间。95%的置信区间变为 ,即0.98到1.67。因此,我们可以95%确定霍奇金病缓解患者中 血细胞计数的平均比值在0.98到1.67之间,最佳估计为1.28。
This confidence interval is for , but we are usually more interested in a confidence interval relating to the scale of the original data. We can do this because the difference between the logarithms of two values is exactly the same as the logarithm of their ratio, i.e. . It follows that the antilog of the mean of the log differences will be an estimate of the geometric mean of the ratio of the variables. The mean value of was 0.25, so that the geometric mean of is given by . Further, we can 'back- transform' our confidence interval for the mean log difference to get a confidence interval for the geometric mean of the ratio . The confidence interval becomes to , or 0.98 to 1.67. Thus we can be sure that on average the ratio of to blood cell counts in patients in remission from Hodgkin's disease is between 0.98 to 1.67, with 1.28 as our best estimate.

对于偏态数据,用比值来表达结果是非常合理的。事实上,研究者(Shapiro等,1986)关注的正是 比值。虽然不是原始单位,但比值的反变换置信区间与原始数据直接相关且易于解释。除了取对数外,没有其他数据变换可以进行反变换。变换单位下的置信区间难以解释,这也是其他变换(如开方)的一大缺点,因为无法获得有意义的置信区间。
It is very reasonable to express results for skewed data in terms of ratios. Indeed, it was the ratio that the researchers (Shapiro et al., 1986) were interested in. Although not in the original units, the back- transformed confidence interval for the ratio is directly related to the original data in an easily interpretable way. No other transformation of data other than taking logs allows back- transformation. Confidence intervals in transformed units are not easily interpretable, so it is a major disadvantage of other transformations, such as taking square roots, that it is not possible to obtain meaningful confidence intervals.

(b) 配对 检验 (b) Paired test

数据进行配对 检验,得到 ,对应的 。因此数据表明,霍奇金病缓解患者中 细胞计数低于 ,尽管差异在5%显著性水平下尚未达到显著。
The paired test of the and data gives for which we have . The data thus suggest that cell counts are lower than among patients in remission from Hodgkin's disease. although the difference is not quite significant at the level.

(c) 评论 (c) Comment

比较独立组时采用类似的方法。例如,使用第9.6.1节和9.6.2节中描述的置信区间和两样本检验,比较两组患者的计数。对偏态数据取对数进行分析的原则同样适用于
A similar approach is used for comparing independent groups. For example, counts in the two groups of patients are compared using the confidence interval and two sample test described in sections 9.6.1 and 9.6.2. The principle of analysing skewed data by taking logs applies equally

本章后续内容及后续章节中描述的更复杂分析方法。这里不再对每种方法逐一说明。
to more complex analyses described later in this chapter and in subsequent chapters. It will not be illustrated for each method.

9.7.2 非参数分析 9.7.2 Non-parametric analysis

配对检验的非参数对应方法是Wilcoxon配对符号秩和检验,我们可以用它对表9.7中给出的原始数据进行非参数分析。该检验与第9.4.5节中描述的一样,是单样本Wilcoxon符号秩和检验,只不过这里将配对值的差值作为样本来计算秩次。计算过程见表9.8。我们可以查阅表B9中负差值秩和(63)或正差值秩和(147),得到
The non- parametric equivalent of the paired test is the Wilcoxon matched pairs signed rank sum test, which we can use to perform a non- parametric analysis of the raw and data given in Table 9.7. This test is identical to the one- sample Wilcoxon signed rank sum test described in section 9.4.5, where we treat the differences between the paired values as our sample for calculating the ranks. The calculations are shown in Table 9.8. We can look up either the sum of the ranks of negative differences (63) or positive differences (147) in Table B9, giving .

表9.8 Hodgkin病组中比较细胞计数的Wilcoxon配对符号秩和检验计算
Table 9.8 Calculations for Wilcoxon matched pairs signed rank sum test to compare and cell counts in the Hodgkin's disease group

T4 - T8 差值 (细胞数/mm³)绝对差值 T4 - T8秩次
-44044013
-41041012
-46646614
-41414
-1161167
-23123110
-15152
-11111
33333
57576
52525
2182189
1201208
34634611
81081018
69269217
49049015
1479147920
67267216
94794719
Difference T4 - T8 (cells/mm3)Absolute difference T4 - T8Rank
-44044013
-41041012
-46646614
-41414
-1161167
-23123110
-15152
-11111
33333
57576
52525
2182189
1201208
34634611
81081018
69269217
49049015
1479147920
67267216
94794719

负差值秩和 正差值秩和
Sum of ranks of negative differences Sum of ranks of positive differences

表9.9 比较的Wilcoxon配对符号秩和检验计算
Table 9.9 Calculations for Wilcoxon matched pairs test to compare and

logT4 - logT8 差值绝对差值秩次原始数据秩次
-0.7470.7471613
-0.5430.5431212
-0.3250.3251014
-0.2150.21584
-0.1900.1905.57
-0.1900.1905.510
-0.0570.05722
-0.0250.02511
0.1190.11933
0.1550.15546
0.1990.19975
0.2450.24599
0.3260.326118
0.5710.5711311
0.6930.6931418
0.6980.6981517
0.7840.7841715
0.9480.9481820
1.2091.2091916
1.3401.3402019
Difference logT4 - logT8Absolute differenceRankRank of raw data
-0.7470.7471613
-0.5430.5431212
-0.3250.3251014
-0.2150.21584
-0.1900.1905.57
-0.1900.1905.510
-0.0570.05722
-0.0250.02511
0.1190.11933
0.1550.15546
0.1990.19975
0.2450.24599
0.3260.326118
0.5710.5711311
0.6930.6931418
0.6980.6981517
0.7840.7841715
0.9480.9481820
1.2091.2091916
1.3401.3402019

负差值秩次和 ,正差值秩次和
Sum of ranks of negative differences Sum of ranks of positive differences

威尔科克森配对符号秩检验的一个特殊之处在于,在常用的非参数方法中,只有它的结果可能受到数据变换的影响。如果我们先对 取对数,再计算差值的秩次,可能会得到不同的结果。基于此,有些统计学家倾向于放弃该检验而采用符号检验;而另一些则建议,当原始数据中较大数值对应较大差异时(可通过绘制 的图形观察),应对数据进行变换。实际上,如单样本检验(第9.4.5节)所述,该方法基于差值服从对称分布的假设。因此,只有当变换能使差值分布更对称时,才应采用变换。
It is a peculiarity of the Wilcoxon matched pairs test that, alone among the commonly used non- parametric methods, the result can be affected by transforming the data. If we take logs of and before calculating the ranks of the differences we may get a different result. Because of this possibility some statisticians reject this test in favour of the sign test, while others suggest that the data should be transformed if the raw data show larger differences for larger data values, as shown by a plot of against . In fact, as noted for the single sample test (section 9.4.5), the method is based on the assumption that the differences have a symmetric distribution. A transformation should therefore be used only if it makes the distribution of the differences more symmetric.

图9.4显示, 的差值分布存在偏斜,而 的差值分布更为对称。表9.9展示了对数值差异及其对应的威尔科克森检验结果。秩次和与之前略有不同,但该情况下的 值为0.10,结果相似。然而,将对数数据的秩次与原始数据的秩次(见表9.9)比较,发现个别患者的秩次存在较大差异。鉴于威尔科克森检验假设配对差值对称,建议对这些数据采用对数变换。
Figure 9.4 showed that the distributions of the differences are skewed, while those for are more symmetric. Table 9.9 shows the differences between the log values and the Wilcoxon test applied to and . The rank sums are slightly different from those obtained before, but the value of 0.10 is a similar result in this case. However, comparison of the ranks obtained from the log data and from the raw data (in Table 9.9) shows that there are some quite substantial differences in the rankings for individual patients. Because the Wilcoxon test is based on the assumption of symmetry of the differences between pairs of observations, it is preferable for these data to use the log transformation.

部分人士对结合非参数方法使用变换持反感态度,认为符号检验可能是分析配对数据更合适的非参数方法。然而,并非所有非参数或分布无关方法都完全不依赖分布假设,且当对称性假设合理时,威尔科克森配对检验优于符号检验。
The use of transformations in conjunction with non- parametric methods is unappealing to some people, who feel that the sign test is probably the preferable non- parametric method for analysing paired data. Not all non- parametric or distribution- free methods are completely free of assumptions about the distribution, however, and the Wilcoxon paired test is preferable to the sign test when the assumption of symmetry is plausible.

9.8 三个或更多独立观察组 9.8 THREE OR MORE INDEPENDENT GROUPS OF OBSERVATIONS

本章迄今为止大多涉及两组观察数据的分析,无论是单一样本的配对数据,还是来自两个独立样本的数据。这些思想可扩展到三个或更多观察组的情况,无论是单一样本还是独立样本。本节仅讨论独立组的情况。单一样本中对每个个体进行多次测量的情况将在第12章讨论。
Most of this chapter so far has related to the analysis of two sets of observations, either paired data for a single sample of individuals or data from two independent samples. These ideas extend to situations where we have three or more sets of observations, either from a single sample or from independent samples. In this section I shall consider only independent groups. The case where several measurements are taken on each individual in a single sample is considered in Chapter 12.

对于多组观察数据,显然可以使用多个 检验比较各组两两之间的差异,但这并非良策。更好的方法是使用单一分析同时考察所有数据,该方法称为单因素方差分析(有时简称为 anova)。本节介绍的方法,包括参数和非参数方法,均适用于两组观察数据,且结果与前述两样本方法一致。例如,两样本 检验是单因素方差分析的特例。顾名思义,单因素方差分析用于当个体仅按一种方式分类时的情况。若存在两个因素对观察进行分类,则需使用双因素方差分析,依此类推。更复杂的分析将在第12.3节介绍。
With several groups of observations it is obviously possible to compare each pair of groups using tests, but this is not a good approach. It is far better to use a single analysis that enables us to look at all the data in one go, and the method we use is called one way analysis of variance (sometimes abbreviated to anova). The methods introduced in this section, both parametric and non- parametric, can all be used when there are only two groups of observations and will give identical results to the two sample methods already described. The two sample test is, for example, a special case of one way analysis of variance. As its name implies, one way analysis of variance is the simplest type which is used when there is a single way of classifying individuals. When there are two factors classifying the observations we need two way analysis of variance, and so on. Some more complicated analyses are described in section 12.3.

本节涵盖的分析最好使用计算机完成—我们已接近“手工计算”可行的极限。尽管本节及后续章节将给出相关方法的公式,但数学细节大多在独立章节中讲解,且假设主要分析使用计算机完成。
The analyses covered in this section are better done by computer - we are reaching the limit of what is practicable for 'hand calculation'. Although the formulae will be given for the methods described in this section and in some subsequent chapters, the mathematical details will mostly be in separate sections, and I shall assume that a computer is used for the main analysis.

9.8.1 单因素方差分析 9.8.1 One way analysis of variance

方差分析的原理是将一组数据的总变异性分解为由不同变异来源引起的组成部分。例如,表9.4中的能量消耗数据的变异性可以分解为组内个体之间的变异和组间任何系统性差异引起的变异。实际上,因为我们的原假设是组间无差异,检验基于观察到的组间变异(即均值之间的差异)与根据个体间观察到的变异性预期的变异进行比较。该比较通常采用 检验来比较方差,但对于两组数据, 检验得出的结论完全相同。样本大小不必相同。
The principle behind analysis of variance is to partition the total variability of a set of data into components due to different sources of variation. For example, the variability of the energy expenditure data in Table 9.4 could be partitioned into that due to variation between individuals within each group, and that due to any systematic difference between the groups. Indeed, because our null hypothesis is that there is no difference between the groups, the test is based on a comparison of the observed variation between the groups (i.e. between their means) with that expected from the observed variability between subjects. The comparison takes the general form of an test to compare variances, but for two groups the test leads to exactly the same answer. The samples do not all have to be the same size.

方差分析最好使用统计软件包进行,但计算方法见第9.9节。统计软件能提供数值结果,但理解其原理同样重要。
Analysis of variance should preferably be performed using a statistical computer package, but the method of calculation is given in section 9.9. A statistical package will produce the numerical results, but it is important to understand the principles involved.

【1】 该分析基于假设样本来自具有相同标准差(或方差)的正态分布总体。正态性和方差齐性不应被默认假设,而应如第(5)点所述进行验证。

  1. The analysis is based on the assumption that the samples come from Normally distributed populations with the same standard deviation (or variance). Normally and equal variance should not be assumed, but should be verified, as described in (5) below.

【2】 因为假设样本来自方差相同的总体,故每组内的方差可作为总体方差的估计。我们将样本方差进行合并(如同两样本检验中所做)以获得总体方差的估计。
2. Because we assume that the samples are from populations with the same variance, the variance within each group is an estimate of the population variance. We thus pool the sample variances (in the same way as we did for the two sample test) to get an estimate of the population variance.

【3】 我们可以利用合并方差估计值计算任意两组均值差异的置信区间。
3. We can use the pooled estimate of variance to calculate a confidence interval for the difference between any pair of means.

【4】 我们可以基于原假设(样本来自均值和方差相同的总体)进行假设检验。因此,可以将观察到的样本均值间变异与若原假设成立时随机抽样所期望的变异进行比较。换言之,我们计算从同一总体随机抽样的样本均值间出现此类变异的概率。比较形式为组间均值方差(组间变异)与组内个体方差的比值。正如前述,我们使用分布表检验两个方差的相等性。
4. We can perform a hypothesis test based on the null hypothesis that the samples are from populations with the same mean and variance. We can thus compare the variation among the observed sample means with what we would expect from random samples if the null hypothesis was true. In other words, we can calculate the probability of observing such variability among means of samples drawn at random from the same population. The comparison takes the form of the ratio of the variance estimated from the means of the groups (the between group variation) and the variance between the individuals within the groups. As we saw earlier, we use tables of the distribution to test the equality of two variances.

【5】 完成方差分析后,应检查个体观测值围绕其组均值的变异。每个个体的组均值为模型拟合值,观测值与拟合值之差称为残差。我们利用残差的方差作为个体间变异的估计,用以评估组间方差。可以绘制残差的正态概率图以评估正态性假设。若正态图不理想,需重新分析数据,可能通过数据变换或采用非参数方法。
5. After carrying out the analysis of variance we should examine the variation of the individual observations around the mean of their sample. For each individual the mean of their group is the value fitted by the model, and the difference between the observed and fitted values is called a residual. It is the variance of these residuals that we use as our estimate of between subject variability, against which we evaluate the between group variance. We can construct a Normal plot of the residuals to assess the assumption of Normality. If the Normal plot is unsatisfactory, we must reanalyse the data, perhaps after transforming the data or by using a non- parametric alternative.

方差分析相关的假设检验也可视为两种统计模型的比较:一种假设各总体均值和标准差相同,另一种假设均值不同(等于观察样本均值),但标准差相同。检验评估第一模型的合理性。若组间变异显著大于预期(如),则倾向于接受第二模型,即组均值存在差异。
Another way of viewing the hypothesis test associated with analysis of variance is that we are comparing two alternative statistical models. In one the mean and standard deviation are the same in each population, while in the other the means are different (and equal to the observed sample means) but the standard deviations are again the same. The test assesses the plausibility of the first model. If the between group variability is greater than expected (with, say, ) we will prefer the second model, in which the means of the groups differ.

若仅有两组,方差分析等同于两独立样本的检验。因此检验得到的值与检验相同。方差比的分子自由度为1,且有关系,可参见表B4和B6。
If we have only two groups the analysis of variance is exactly equivalent to the test for two independent groups. Thus the test yields the same value as the test. The numerator in the variance ratio has just one degree of freedom and we have the relation , as can be seen from Tables B4 and B6.

9.8.2 例子 9.8.2 Example

22名接受心脏搭桥手术的患者被随机分配到三个通气组之一:
Twenty- two patients undergoing cardiac bypass surgery were randomized to one of three ventilation groups:

组I患者连续24小时接受50%一氧化二氮和50%氧气的混合气体;
Group I Patients received a nitrous oxide and oxygen mixture continuously for 24 hours;

组II患者仅在手术期间接受50%一氧化二氮和50%氧气的混合气体;
Group II Patients received a nitrous oxide and oxygen mixture only during the operation;

组III患者未接受一氧化二氮,但连续24小时接受35%至50%的氧气。
Group III Patients received no nitrous oxide but received oxygen for 24 hours.

表9.10显示了三组患者在通气24小时后的红细胞叶酸水平。我们希望比较这三组,并检验三组红细胞叶酸水平相同的原假设。
Table 9.10 shows red cell folate levels for the three groups after 24 hours' ventilation. We wish to compare the three groups, and test the null hypothesis that the three groups have the same red cell folate levels.

数据检查未发现明显异常值,各组数据看起来像是来自正态分布的合理样本。这些特征从图9.5比表9.10更容易观察。组I的标准差明显高于其他组,但适度的变异性并非问题,尤其是在样本量较小时。总体而言,假设各组来自方差相同的总体是重要的。Bartlett检验是对检验(见9.6.5节)的扩展,用于评估多个样本是否来自方差相同的总体的原假设。
Examination of the data does not reveal any obvious outliers and the data in each group look plausible samples from a Normal distribution. These attributes are more easily seen from Figure 9.5 than Table 9.10. The standard deviation in group I is rather higher than those in the other groups, but moderate variability is not a problem, especially when the samples are small. In general, however, the assumption that the groups come from populations with the same variance is important. Bartlett's test is an extension of the test (described in section 9.6.5) for assessing the null hypothesis that more than two samples come from populations with

208 比较组别—连续数据
208 Comparing groups - continuous data

表9.10 三组接受不同一氧化二氮通气水平的心脏搭桥患者的红细胞叶酸水平()(Amess等,1978)
Table 9.10 Red cell folate levels in three groups of cardiac bypass patients given different levels of nitrous oxide ventilation (Amess et al., 1978)

组I (n = 8)组II (n = 9)组III (n = 5)
243206241
251210258
275226270
291249293
347255328
354273
380285
392295
309
均值316.6256.4278.0
标准差58.737.133.8
Group I (n = 8)Group II (n = 9)Group III (n = 5)
243206241
251210258
275226270
291249293
347255328
354273
380285
392295
309
Mean316.6256.4278.0
SD58.737.133.8


图9.5 三组心脏搭桥患者的红细胞叶酸水平(数据来源表9.10)。
Figure 9.5 Red cell folate levels in three groups of cardiac bypass patients (data 1 Table 9.10).

相同的方差。一些计算机程序包含此检验。尽管其检验力不强(详见 Armitage 和 Berry (1987,第209页))。
the same variance. Some computer programs incorporate this test. Although it is not very powerful (see Armitage and Berry (1987, p. 209) for details).

方差分析的计算见表9.11。数据集的总变异性由总平方和衡量,基于22个观测值与总体均值差的平方和。该总平方和被划分为:(a) 组内平方和,计算为每个观测值与其所属组均值差的平方和;(b) 组间平方和,基于每组均值与总体均值差的平方和。每个平方和通过除以其自由度转化为估计方差(称为均方)。按照方差自由度通常比观测数少一的原则,组间自由度为 ,组内自由度为 。如表9.11所示,平方和和自由度之和等于将数据视为单一样本时的值。
The analysis of variance calculations are shown in Table 9.11. The total variability of the data set is measured by the total sum of squares, which based on the sum of the squares of the differences of each of the 22 observations from the overall mean. This total is partitioned into (a) the within groups sum of squares, calculated as the sum of squares of the difference between each observation and the mean of its relevant group, and (b) the between groups sum of squares, which is based on the sum of squares of the difference between the mean of each group and the overall mean. Each sum of squares is converted into an estimated variance (known as a mean square) by dividing by its degrees of freedom. Following the usual principle that the degrees of freedom for a variance are one less than the number of observations, there are degrees of freedom between groups and degrees of freedom within groups. As Table 9.11 shows, the sums of squares and degrees of freedom add up to the values that are obtained if we consider the data as a single sample.

表9.11 表9.10数据的方差分析表
Table 9.11 Analysis of variance table for data in Table 9.10

变异来源自由度平方和均方F值P值
组间215515.887757.93.710.04
组内1939716.092090.3
总计2155231.97
Source of variationDegrees of freedomSums of squaresMean squaresFP
Between groups215 515.887757.93.710.04
Within groups1939 716.092090.3
Total2155 231.97

在零假设下,假定所有组的均值和方差相同,我们期望组间方差与组内方差相等,因此方差比值应为1。我们可以使用 分布比较方差,从而检验零假设。两组方差分别为7757.9和2090.3,其比值为3.71。换言之,组间观察到的方差是零假设成立时预期方差的3.71倍。将3.71与自由度为2和19的 分布(见表B6)比较,得到 。(更精确的P值为 。)
Under the null hypothesis that all the groups have the same mean and variance we expect the between groups variance and the within groups variance to be the same, so we expect the ratio of the variances to be 1. We can use the distribution to compare the variances and so evaluate the null hypothesis. The two variances are 7757.9 and 2090.3, and their ratio is 3.71. In other words, the observed variance among the groups is 3.71 times what we would expect if the null hypothesis were true. Comparing 3.71 with the distribution with 2 and 19 degrees of freedom given in Table B6, we find . (A more exact value is

表9.11的最后一点是,组内均方的平方根称为残差标准差 ,因为它是残差的标准差。残差标准差也是组内合并标准差(类似于两样本 检验中计算的),可用于推导置信区间。
A last point to note from Table 9.11 is that the square root of the within group mean square is called the residual standard deviation because it is the standard deviation of the residuals. The residual standard deviation is also the pooled within groups standard deviation (analogous to that calculated for the two sample test) from which we can derive confidence intervals.

9.8.3 置信区间 9.8.3 Confidence intervals

任何组的均值都可以用常规方法构建置信区间,只是标准误基于残差标准差。因此,若感兴趣组有 个观测值,均值为
A confidence interval can be constructed for the mean of any group in the usual way, except that the standard error we use is based on the residual standard deviation. Thus, if there are observations in the group of

该组均值的标准误为
interest with a mean of , the standard error of the mean is given by

95% 置信区间由下式给出
The confidence interval is given by

其中, 值的自由度与方差分析表中的残差自由度相对应。
where the value has the number of degrees of freedom associated with the residual in the analysis of variance table.

类似地,两个均值差异的置信区间,例如 ,需要 的标准误,计算公式为
Similarly, a confidence interval for the difference between any two means, say and , requires the standard error of , which is given by

两个均值差异的 95% 置信区间因此为
The confidence interval for the difference between the two means is thus given by

其中, 值仍采用残差自由度。
where the value again has the residual degrees of freedom.

例如,我们可以计算表 9.10 中组 I 与组 II 均值差的置信区间。红细胞叶酸水平均值差为 。残差标准差为 ,因此均值差的标准误为 。自由度为 19 时 的值(见表 B4)为 2.093,故均值差的 95% 置信区间为
For example, we can produce a confidence interval for the difference between groups I and II in Table 9.10. The difference in mean red cell folate levels was . The residual standard deviation is , so the standard error of the difference in means is . The value of with 19 degrees of freedom is found (from Table B4) to be 2.093, so the confidence interval for the difference in means is

或者 13.7 到
or 13.7 to .

9.8.4 多重比较 9.8.4 Multiple comparisons

对于两个组来说,显著差异的解释相对简单,但当三个或更多组的均值存在显著差异时,我们该如何解读?需要进一步分析以确定均值之间的具体差异,例如某一组是否与其他所有组不同。如果组具有明确的顺序,例如比较不同剂量的药物,则有一种简单的方法,下一节将介绍。当组没有顺序时,调查组间差异没有明确的最佳方法。注意,只有当方差分析中组间总体比较显著时,才应调查个别组之间的差异,除非某些比较是在分析前预先设定的。
With two groups the interpretation of a significant difference is reasonably straightforward, but how do we interpret significant variation among the means of three or more groups? Further analysis is required to find out how the means differ, for example whether one group differs from all the others. If the groups have a clear ordering, for example when different doses of a drug are compared, there is a straightforward approach which will be described in the next section. When the groups are not ordered. however, there is no clearly best approach to investigate variation among the groups. Note that you should only investigate differences between individual groups when the overall comparison of groups in the analysis of variance is significant unless certain comparisons were intended in advance of the analysis.

一种可能的方法是依次比较每对均值,或者仅比较感兴趣的对。困难在于多重显著性检验会大大增加偶然发现显著差异的概率。每次检验在无真实差异时都有 5% 的假阳性概率(第一类错误),例如有四个组时,进行全部六对比较,至少出现一次假阳性的概率远大于 5%。为解决此问题,提出了若干方法,如 Bonferroni、Newman-Keuls、Duncan 和 Scheffe 等。这些方法旨在将整体第一类错误率控制在不超过 5%(或其他指定水平)。
One possibility is to compare each pair of means in turn, or perhaps just those pairs of interest. The difficulty here is that multiple significance testing gives a high probability of finding a significant difference just by chance. Each test has a chance of a false positive result when there is no real difference (a Type I error) so if we have, say, four groups and perform all six paired tests the probability of at least one false positive result is very much greater than . Several methods have been proposed to deal with this problem, with strange names such as Bonferroni, Newman- Keuls, Duncan and Scheffe. Each method is aimed at controlling the overall Type I error rate at no more than (or some other specified level).

这些方法的缺点是它们较为“保守”,倾向于安全(非显著)判断。令人不安的是,尽管方差分析中的 检验显著,但没有任何一对均值差异显著。
The disadvantage of all of these methods is that they are 'conservative', in that they err on the side of safety (non- significance). It can be disconcerting to find that, although the test in the analysis of variance is statistically significant, no pair of means is significantly different.

对这些问题没有简单或完全满意的解决方案,但当组无自然顺序时,我推荐以下策略:
There is no simple nor totally satisfactory solution to these problems, but I recommend the following strategy when the groups do not have any natural order:

【1】 在分析前决定特别感兴趣比较的组(越少越好);

  1. Decide in advance of the analysis which groups you are particularly interested in comparing (the fewer the better);

【2】 使用 Bonferroni(或其他)方法调整 值,进行感兴趣组对的修正 检验。
2. Perform modified tests to compare the pairs of groups of interest, using the Bonferroni (or some other) method to adjust the values.

修正的 检验基于所有组的合并方差估计(即方差分析表中的残差方差),而非仅考虑的那一对组。因此, 计算公式为
The modified test is based on the pooled estimate of variance from all the groups (which is the residual variance in the anova table), not just the pair being considered. So is calculated as

其中 如前节所述。
where is as given in the previous section.

如果我们进行 次配对比较,则应将每次检验得到的 值乘以 ;即计算 ,且 不得超过1。这种简单调整方法称为 Bonferroni 方法。对于较少次数的比较(例如最多五次),该方法是合理的,但对于大量比较则过于保守。然而,我不建议进行大量比较,这通常意味着研究目标定义不明确。统计软件可能提供不同的多重比较方法,如 Duncan 多重范围检验。这些方法的原理类似,但比 Bonferroni 方法保守性低。
If we perform paired comparisons, then we should multiply the value obtained from each test by ; that is, we calculate with the restriction that cannot exceed 1. This simple adjustment is known as the Bonferroni method. For small numbers of comparisons (say up to five) its use is reasonable, but for large numbers it is highly conservative. However, I do not recommend that large numbers of comparisons are performed, which would suggest poorly specified research objectives. Statistical packages may offer different multiple comparison procedures, such as Duncan's multiple range test. These all work in a similar way, but are less conservative than the Bonferroni method.

回到表9.10和9.11中的红细胞叶酸数据,残差标准差为 。比较组I和组II的修正 检验通过计算得出:
Returning to the red cell folate data in Tables 9.10 and 9.11, the residual standard deviation was . A modified test to compare groups I and II is performed by calculating

212 比较组别 - 连续数据
212 Comparing groups - continuous data

如果我们比较每一对组别,将进行三次比较。上述 值2.71对应 (表B4),精确值为 。校正后的 值为 ,调整后刚好在5%显著性水平上显著。其他比较均不显著。因此,方差分析(表9.11)中识别的组间差异主要是组I与组II之间的差异。
If we are comparing each pair of groups we will make three comparisons. The above value of 2.71 corresponds to (Table B4), with an exact value of . The corrected value is so it is just significant at the level after adjustment. Neither of the other comparisons is significant. The main explanation for the difference between the groups that was identified in the analysis of variance (Table 9.11) is thus the difference between groups I and II.

9.8.5 有序组 9.8.5 Ordered groups

当组别有序时,不宜比较每一对组别,而应研究组间是否存在趋势。对于许多目的,考虑是否存在线性趋势即可。
When the groups are ordered it is not reasonable to compare each pair of groups, but rather we should study the possibility that there is a trend across groups. For many purposes it will suffice to consider whether there is a linear trend.

表9.12显示了健康志愿者按六个年龄组划分的血清胰蛋白酶水平的均值和标准差。我们可以利用第9.9节给出的公式,从这些汇总统计量进行单因素方差分析,而无需原始数据,结果见表9.13。(遗憾的是,很少有统计软件能直接用已计算的均值和标准差进行方差分析。)显然,六个年龄组间存在高度显著的差异。然而,我们还可以进一步“分解”组间变异成不同成分。这里我们更关心是否存在线性趋势,即血清胰蛋白酶值是否随年龄增加而以恒定速率上升。
Table 9.12 shows the mean and standard deviation of serum trypsin levels in healthy volunteers divided into six age groups. We can carry out one way analysis of variance from these summary statistics without having the raw observations, using the formulae given in section 9.9, to get the results shown in Table 9.13. (Unfortunately, very few statistical packages will perform analysis of variance using means and standard deviations that are already calculated.) Clearly there is highly significant variation among the six age groups. However, we can go further by 'partitioning' the variability between groups into components. Here we would be more interested in whether there was a linear trend, that is whether serum trypsin values tend to rise at a constant rate with increasing age.

表9.12 健康志愿者按六个年龄组划分的免疫反应性胰蛋白酶血清水平(数据来源:Koehn 和 Mostbeck,1981)
Table 9.12 Serum levels of immunoreactive trypsin in healthy volunteers divided into six age groups (based on data given by Koehn and Mostbeck, 1981)

年龄
10-1920-2930-3940-4950-5960-69
受试者人数321373844164
均值(ng/ml)128152194207215218
标准差 (ng/ml)50.958.549.366.360.014.0
Age
10-1920-2930-3940-4950-5960-69
Number of subjects321373844164
Mean (ng/ml)128152194207215218
Standard deviation (ng/ml)50.958.549.366.360.014.0

表9.13 表9.12数据的一因素方差分析
Table 9.13 One way analysis of variance of data in Table 9.12

变异来源自由度平方和均方F值P值
组间522410344820.613.5< 0.0001
组内2658792723318.0
总计2701103375
Source of variationdfSums of squaresMean squaresFP
Between groups5224 10344 820.613.5&lt; 0.0001
Within groups265879 2723 318.0
Total2701 103 375

利用第9.9节给出的公式,我们发现与线性趋势相关的平方和为55147,自由度为1,因此方差分析表可重写为表9.14所示。存在高度显著的线性趋势,表明血清胰蛋白酶平均水平随年龄增加而升高。然而,年龄组间的非线性变异也高度显著,说明线性趋势仅解释了部分年龄效应。在一因素方差分析中拟合线性趋势等同于线性回归分析,后者将在第11章介绍。
Using the formula given in section 9.9 we find that the sum of squares associated with a linear trend is 55147 on one degree of freedom, so the analysis of variance table can be rewritten as shown in Table 9.14. There is a highly significant linear trend, showing that mean serum trypsin level rises with age. However, the non- linear variation between the age groups is also highly significant, indicating that the linear trend only explains some of the age effect. Fitting a linear trend in one way analysis of variance is equivalent to linear regression analysis, which is described in Chapter 11.

表9.14 显示线性趋势检验的方差分析表
Table 9.14 Analysis of variance table showing test for linear trend

变异来源自由度平方和均方F值P值
组间:(a)线性522410344820.6
15514755147.016.6< 0.0001
(b)非线性416895642239.012.7< 0.0001
组内:2658792723318.0
总计2701103375
Source of variationdfSums of squaresMean squaresFP
Between groups: (a) linear5224 10344 820.6
155 14755 147.016.6&lt; 0.0001
(b) non-linear4168 95642 239.012.7&lt; 0.0001
Within groups:265879 2723 318.0
Total2701 103 375

9.8.6 非参数一因素方差分析—Kruskal-Wallis检验 9.8.6 Non-parametric one way analysis of variance - the Kruskal-Wallis test

正如方差分析是检验的更一般形式,非参数的Mann-Whitney检验也有更一般的形式。Kruskal-Wallis检验是Mann-Whitney检验的明显数学推广,且存在与一因素方差分析相同的解释问题。
Just as analysis of variance is a more general form of test, so there is a more general form of the non- parametric Mann- Whitney test. The Kruskal- Wallis test is an obvious mathematical extension of the Mann- Whitney test, with the same problems of interpretation as were just discussed for one way analysis of variance.

检验统计量的计算很简单。将全部个观测值不分组别地排名,排名从1到,然后计算每组的排名总和。若第组中个观测值的排名和为,则该组的平均排名为。统计量定义为
The calculation of the test statistic is simple. The complete set of observations are ranked from 1 to regardless of which group they are in, and for each group the sum of the ranks is calculated. If the sum of the ranks of observations in the th group is , then the average rank in each group is given by . We calculate the statistic defined by

其中,(\bar{R}) 是所有秩的平均值,且总是等于 ((N + 1) / 2)。该公式中的求和项与参数单因素方差分析中计算的组间平方和非常相似(数学表达见第9.9节)。
where is the average of all the ranks, and is always equal to . The summation term in this formula is very similar to the between group

虽然这个 (H) 的公式展示了检验的原理,但计算时有一个等效且更简单的版本,(H) 表达为
sum of squares calculated in parametric one way analysis of variance (shown mathematically in section 9.9). While this formula for shows the way the test works, there is an equivalent but simpler version for calculation, with given by

Kruskal-Wallis检验统计量的分布不同于本章介绍的其他方法。当原假设成立时,检验统计量服从卡方分布,卡方的希腊字母为 (\chi),发音类似于“sky”中的“ky”。卡方分布主要用于分类数据分析,因此将在下一章(第10.6.3节)中详细讨论。目前只需注意,组间的任何变异都会使检验统计量 (H) 增大,因此我们只关注卡方分布的右尾。对于三个或更多组,单边和双边检验的概念不适用。
The Kruskal- Wallis test statistic has a different distribution from the other methods described in this chapter. When the null hypothesis is true the test statistic follows the Chi squared distribution, where Chi is the Greek letter which is pronounced as 'ky' in 'sky'. The Chi squared distribution is mainly used for the analysis of categorical data, and so will be considered in more detail in the next chapter (section 10.6.3). For the moment it should be sufficient to note that any variation among the groups will increase the test statistic . We therefore are concerned with only the upper tail of the Chi squared distribution. The idea of one- and two- sided tests does not apply with three or more groups.

如果有 组观测值,统计量 会与自由度为 的卡方分布进行比较。统计学上显著的结果意味着我们拒绝各组来自具有相同中位数总体的假设,得出各组之间存在差异的结论。
If there are groups of observations, the statistic is compared with a Chi squared distribution with degrees of freedom. A statistically significant result means that we reject the hypothesis that the groups come from populations with the same median, and conclude that there are differences among the groups.

可以使用两样本 Mann-Whitney 检验尝试确定差异所在,同时适当考虑多重检验的影响。如果各组有序,可以像上文单因素方差分析(见第9.8.7节)所述,进行趋势检验。
Two sample Mann- Whitney tests can be used to try to identify where the differences are, making due allowance for multiple testing. If the groups are ordered it is possible to test for a trend, in a similar way as described above for one way analysis of variance (see section 9.8.7).

Fentress 等人(1986年)报道了一项随机对照试验,比较了三组各六名患有频繁且严重偏头痛儿童的治疗效果。所用的活跃治疗包括放松反应训练,有无生物反馈,第三组儿童未接受治疗。研究期间前后记录了头痛的频率和持续时间,二者差值作为每周头痛活动的衡量指标。
Fentress et al. (1986) reported the results of a randomized comparison of three groups of six children suffering from frequent and severe migraine. The active treatments given were relaxation response, either with or without biofeedback, and a third group of children was not treated. The frequency and duration of headaches were recorded before and after the study period, and the difference between these measurements was used as a measure of weekly headache activity.

表9.15显示了每位儿童头痛活动的减少百分比。注意,负值表示头痛活动增加。三名儿童在研究结束时完全无头痛,故头痛活动减少了 。这些数据明显不适合方差分析,但我们可以应用 Kruskal-Wallis 检验。表9.15还列出了数据的秩次及各组的平均秩次。利用上述公式,我们可以计算统计量
Table 9.15 shows the reduction in headache activity for each child, expressed as a percentage. Note that a negative value indicates an increase in headache activity. Three children had a complete absence of headaches at the end of the study period and thus a reduction of . These observations are clearly unsuited for analysis of variance, but we can apply the Kruskal- Wallis test. Table 9.15 also shows the ranks of the data and the mean rank for each group. Using the equation given above we can calculate the statistic as

表9.15 三个治疗组每周头痛活动减少百分比,基于基线数据(Fentress等,1986)。括号内为秩次。
Table 9.15 Reduction in weekly headache activity for three treatment groups, expressed as a percentage of baseline data (Fentress et al., 1986). Ranks are shown in brackets.

放松反应与生物反馈仅放松反应未治疗
62 (11)69 (10)50 (12)
74 (8.5)43 (13)-120 (17)
86 (7)100 (2)100 (2)
74 (8.5)94 (5)-288 (18)
91 (6)100 (2)4 (15)
37 (14)98 (4)-76 (16)
秩和553680
平均秩9.176.0013.33
Relaxation response and biofeedbackRelaxation response aloneUntreated
62 (11)69 (10)50 (12)
74 (8.5)43 (13)-120 (17)
86 (7)100 (2)100 (2)
74 (8.5)94 (5)-288 (18)
91 (6)100 (2)4 (15)
37 (14)98 (4)-76 (16)
Rank sum553680
Mean rank9.176.0013.33

从表B5中我们发现,自由度为2时,值为5.69,对应的值介于0.1和0.05之间,更接近。(实际上是0.058。)
From Table B5 we find that a value of 5.69 on 2 degrees of freedom gives between 0.1 and 0.05, and is much nearer to . (It is actually 0.058. )

由于各组样本量较小,使用Mann-Whitney检验比较每对组的差异统计效能不强,事实上即使不考虑多重比较,三个值均大于0.05。然而,合理的做法是考虑两个主动治疗组合并后是否优于未治疗对照组,Mann-Whitney检验得出,支持两种治疗均有效的观点,但样本量不足以区分两种治疗的差异。
Because the groups are small, comparison of each pair of groups with Mann- Whitney tests is not very powerful, and in fact all three values are greater than 0.05 even without allowing for multiple comparisons. However, it is reasonable to consider whether the two actively treated groups together did better than the untreated controls, and a Mann- Whitney test gives , supporting the suggestion that both treatments are beneficial but that the study is too small to be able to distinguish them.

如果将Kruskal-Wallis检验应用于仅两个观察组,结果与Mann-Whitney检验完全相同。前者的检验统计量是后者统计量的平方。
If we apply the Kruskal- Wallis test to just two groups of observations we obtain exactly the same result as that from the Mann- Whitney test. The test statistic from the former is the square of the statistic from the latter.

9.8.7 有序组的非参数检验 9.8.7 Non-parametric test for ordered groups

(本节可略读,不影响内容连贯性。)
(This section can be omitted without loss of continuity.)

有多种非参数方法用于检测有序组间的趋势。以下介绍的方法由Cuzick(1985)提出。如果只关心趋势,则无需进行Kruskal-Wallis检验。
There are several non- parametric methods to test for trend across ordered groups. The method described below is due to Cuzick (1985). It is not necessary to perform the Kruskal- Wallis test if the trend is the only aspect of interest.

设有组样本,样本量分别为),总样本量为。各组赋予得分,反映其顺序,如1、2、3。得分不必等距,但通常是。将全部个观察值排名,从1到,计算各组秩和
We have groups of sample sizes ( ), where . The groups are given scores, , which reflect their ordering, such as 1, 2 and 3. The scores do not have to be equally spaced, but they usually are. The total set of observations are ranked from 1 to , and the sums of

计算所有组得分的加权和,公式为
the ranks in each group, , are obtained. We calculate a weighted sum of all the group scores, , as

统计量 计算公式为
The statistic is calculated as

在原假设下, 的期望值为 ,其标准误为
Under the null hypothesis the expected value of is , and its standard error is

因此检验统计量 由下式给出
so that the test statistic, , is given by

当无趋势的原假设成立时, 近似服从标准正态分布。
which has an approximately standard Normal distribution when the null hypothesis of no trend is true.

表9.16显示了32副太阳镜的眼部紫外线暴露情况,这些太阳镜根据透过的可见光量分为三组。我们可以用刚才描述的方法检验这三组中暴露量是否存在递增趋势。
Table 9.16 shows ocular exposure to ultraviolet radiation for 32 pairs of sunglasses classified into three groups according to the amount of visible light transmitted. We can use the method just described to test for a trend for increasing exposure across the three groups.

三组的评分分别为 (相较于 ,这样简化了计算)。部分计算过程,包括 的计算,见表9.16。我们有
The groups are given scores , , and (which simplifies the arithmetic in comparison with scores of , , and ). Some of the calculations, including that of , are shown in Table 9.16. We have


and

因此检验统计量为
so that the test statistic is given by

因此,几乎没有证据支持眼睛暴露于紫外线与可见光透过量有关的假设。
There is thus little evidence to support the suggestion that ocular exposure to ultraviolet radiation is related to the amount of visible light transmitted.

表9.16 太阳镜对眼睛紫外线暴露的影响与可见光透过量的关系(Rosenthal 等,1988)。眼睛暴露量以无太阳镜时的暴露百分比表示。括号内为观测值的秩次。
Table 9.16 The effect of sunglasses on ocular exposure to ultraviolet radiation in relation to amount of visible light transmitted (Rosenthal et al., 1988). Ocular exposure is expressed as the percentage of exposure without sunglasses. The ranks of the observations are shown in brackets

< 25%可见光透过率
25% 到 35%> 35%
1.4(9)0.9(2)0.8(1)
1.4(9)1.0(3)1.7(14)
1.4(9)1.1(4.5)1.7(14)
1.6(12)1.1(4.5)1.7(14)
2.3(18)1.2(6.5)3.4(26)
2.5(19)1.2(6.5)7.1(30)
1.5(11)8.9(31)
1.9(16)13.5(32)
2.2(17)
2.6(21)
2.6(21)
2.6(21)
2.8(23.5)
2.8(23.5)
3.2(25)
3.5(27)
4.3(28)
5.1(29)
总计
ni618832(=N)
Ri76290162
li-101
lini-6082(=L)
Ri li-76016286(=T)
li2ni60814
&lt; 25%Transmission of visible light
25 to 35%&gt; 35%
1.4(9)0.9( 2 )0.8( 1)
1.4(9)1.0( 3 )1.7(14)
1.4(9)1.1( 4.5)1.7(14)
1.6(12)1.1( 4.5)1.7(14)
2.3(18)1.2( 6.5)3.4(26)
2.5(19)1.2( 6.5)7.1(30)
1.5(11 )8.9(31)
1.9(16 )13.5(32)
2.2(17 )
2.6(21 )
2.6(21 )
2.6(21 )
2.8(23.5)
2.8(23.5)
3.2(25 )
3.5(27 )
4.3(28 )
5.1(29 )
Total
ni618832(=N)
Ri76290162
li-101
lini-6082(=L)
Ri li-76016286(=T)
li2ni60814

9.8.8 重复观测 9.8.8 Replicated observations

本节描述的方法仅适用于每个个体仅测量一次的情况。若每人有两次或更多重复测量,则必须采用更复杂的方法,其中部分方法在第12.3节中有所介绍。对于第12章未涵盖的设计,建议咨询统计学家。在某些情况下,分析重复观测的平均值可能合理,但这可能会丢失有价值的信息。绝不可将每个个体的多次观测当作独立样本处理。
The methods described in this section apply only when a single measurement is made for each individual. If two or more replicated measurements are taken for each person, then more complex methods must be used, some of which are described in section 12.3. For designs not covered in Chapter 12 it is advisable to consult a statistician. In some cases it may be reasonable to analyse the average of replicate observations but this may throw away valuable information. It is never valid to treat multiple observations from each individual as if they were independent.

9.9 单因素方差分析—数学原理与实例演示 9.9 ONE WAY ANALYSIS OF VARIANCE - MATHEMATICS AND WORKED EXAMPLE

(本节给出第9.8节中描述计算方法的数学公式,内容可省略而不影响连贯性。)
(This section gives the mathematical formulae for the calculations described in section 9.8. It can be omitted without loss of continuity.)

9.9.1 单因素方差分析 9.9.1 One way analysis of variance

大多数统计软件包都包含基于原始数据的单因素方差分析,因此通常无需使用下面(a)项中的公式。然而,当仅有均值和标准差时,如第9.9.3节的实例所示,可能需要使用(b)项中的方法。
Most statistical packages include one way analysis of variance using the raw data, so it should not be necessary to use the formulae in (a) below. However, the method given in (b) will probably be needed when only means and standard deviations are available, as in the worked example in section 9.9.3.

(a)原始数据可用 (a) Raw data available

单因素方差分析的计算基于每个样本观测值的总和。假设有 个样本,第 个样本中有 个观测值,则计算如下:
The calculations for one way analysis of variance are expressed in relation to the sum of the observations in each sample. Suppose we have samples of observations, with observations in the sample, then we calculate

组观测值的均值,
mean of observations in group,

组观测值的平方和,
sum of squares of observations in group,

所有观测值的总和
sum of all observations ,

所有观测值的平方和
sum of squares of all observations ,

总观测值数量
total number of observations .

单因素方差分析的平方和如下:
The sums of squares for the one way analysis of variance are as follows:

变异来源平方和
组间:
组内:
总计
Source of variationSum of squares
Between groups:
Within groups:
Total

组间自由度为 ,组内自由度为 。均方为平方和除以相应的自由度。组内均方的平方根即残差标准差,记为
There are degrees of freedom between groups and within groups. The mean squares are the sums of squares divided by the degrees of freedom. The square root of the within groups mean square is the residual standard deviation, .

(b) 已知均值和标准差 (b) Means and standard deviations available

如果我们已知每组大小为 的均值 和标准差 ,可以用上述 的公式,结合一个更简单的计算组内平方和 的方法:
If we already have the mean and standard deviation for each group of size we can use the above formulae for and together with a simpler method of calculating the within groups sum of squares, , as

9.9.2 线性趋势 9.9.2 Linear trend

如果各组有自然的顺序,组间平方和可以分解为线性趋势部分和剩余的(非线性)部分。我们给各组赋予分数 ,这些 的值等距分布,且其和为零。然后计算
If there is a natural ordering of the groups, the between groups sum of squares can be partitioned into a component due to a linear trend, and the remaining (non- linear) component. We give scores to the groups, where the values of the are equally spaced and chosen so that their sum is zero. We then calculate

及其标准误差
and its standard error

可以通过比较 与自由度为组内自由度的 分布,进行单样本 检验。
A one sample test can be performed by comparing to the distribution with the number of degrees of freedom within groups.

另一种方法是计算与 相关的平方和
Alternatively, the sum of squares due to can be calculated as

并通过将组间平方和分解为线性和非线性部分,重新计算方差分析表。线性对比的 检验与上述 检验完全等价。
and the analysis of variance table recalculated by partitioning the between group sum of squares into linear and non- linear components. The test for the linear contrast is exactly equivalent to the above test.

(此方法等同于以 作为自变量进行回归分析—参见第11.10节和11.15.1节。)
(This method is equivalent to performing a regression analysis with the as explanatory variable - see sections 11.10 and 11.15.1. )

9.9.3 具体示例 9.9.3 Worked example

对于表9.12中的血清胰蛋白酶数据,271个观测值的总和为
For the serum trypsin data in Table 9.12 the sum of the 271 observations is given by

组内平方和通过基于标准差的公式计算得出为
The within groups sum of squares is obtained from the formula based on standard deviations as

而量
and the quantity is

因此,组间平方和为
The between groups sum of squares is thus

方差分析的完整表格见表9.13。残差标准差为 。为了评估可能的线性趋势,我们给组赋予分数 ,这些分数等间距且和为零,例如 。线性对比的值为
The complete analysis of variance table is shown in Table 9.13. The residual standard deviation is . To evaluate a possible linear trend we give the groups scores which are equally spaced and add to zero, such as , and 5. The value of the linear contrast is then

它的标准误为
and its standard error is

跨年龄组拟合线性趋势的计算见表9.17。线性对比的 检验结果为
The calculations for fitting a linear trend across age groups are shown in Table 9.17. The test for the linear contrast gives .

另一种方法是, 的平方和为 。它显示在表 9.14 中。 检验与上述 检验完全等价。 值(16.6)等于 值(4.08)的平方,这一点证明了两者的等价性。
Alternatively, the sum of squares for is . It is shown in Table 9.14. The test is exactly equivalent to the test above. It is shown by the value of (16.6) being equal to the square of the value of (4.08).

表 9.17 计算表 9.12 中胰蛋白酶数据线性趋势的平方和
Table 9.17 Calculating the sum of squares for linear trend in scrue trypsin data from Table 9.12

组别样本量 ny1l1l1y1l2/n1
132128-5-640.00.78125
2137152-3-456.00.06569
338194-1-194.00.02632
4442071207.00.02273
5162153645.00.56250
6421851090.06.25000
合计652.07.70849
Groupny1l1l1y1l2/n1
132128-5-640.00.78125
2137152-3-456.00.06569
338194-1-194.00.02632
4442071207.00.02273
5162153645.00.56250
6421851090.06.25000
Total652.07.70849

9.10 结果展示 9.10 PRESENTATION OF RESULTS

仅仅以 P 值,甚至以检验统计量和 P 值来呈现统计分析结果是远远不够的。应当引用一些实际的结果。本章关注的是连续数据的分析
It is never sufficient to present the results of a statistical analysis solely as a P value, or even as a test statistic and P value. Some actual results should be quoted. This chapter has been concerned with continuous data for

应该给出均值或中位数,以及数据变异性的某种度量。
which means or medians should be given, along with some measure of variability of the data.

如果使用了 检验或方差分析,则应给出每组数据的标准差。然而,如果使用配对 检验,则应报告组间差异的标准差。对于单因素方差分析,不必一定呈现方差分析表,但这可能有帮助。引用残差标准差是有价值的。
If a test or analysis of variance has been used then the standard deviation of the data in each group should be given. However, if a paired test is used the standard deviation of the differences between groups should be quoted. For one way analysis of variance it is not necessary to present the analysis of variance table, but it may be helpful. It is valuable to quote the residual standard deviation.

此外,构建一个或多个均值或均值差的置信区间可能很有用。置信区间优于仅报告标准误,因为标准误本身帮助不大(参见第8章)。
In addition it may be useful to construct one or more confidence intervals for means or differences between means. Confidence intervals are preferable to quoting standard errors, which are not very helpful as they stand (see Chapter 8).

对于用非参数方法分析的数据,如果未显示原始数据,则应给出每组的中位数和选定分位数(例如第10和第90百分位)。对于小样本,可以给出中位数和范围。对于所有分析,最好报告检验统计量 以及由此得出的 值。自由度应始终明确。
For data analysed by a non- parametric method the median and selected centiles (e.g. 10th and 90th) should be given for each group if the raw data are not shown. For small samples the median and range can be given. For all analyses, it is good practice to quote the test statistic as well as the value derived from it. It should always be clear what the degrees of freedom are.

图形展示通常采用均值和标准差或标准误“误差条”,但如果可能,展示原始数据更具信息量。图3.14展示了Lind等人(1984年)的一些数据,其中显示了所有原始数据和汇总统计。图9.6展示了均值和标准误本身信息量相对较少。对于偏态分布的数据,信息损失尤为明显。以均值 标准差的形式展示,暗示数据具有对称性,而这可能并不存在。
Graphical presentation is often by means and standard deviation or standard error 'bars', but it is much more informative to show the raw data where possible. Figure 3.14 showed some data of Lind et al. (1984), in which all the raw data and summary statistics are shown. Figure 9.6 shows how comparatively uninformative the means and standard errors are on their own. For data which have a skewed distribution the loss of information is particularly marked. The presentation as, say, mean SD implies a


图9.6 图3.14仅以均值和标准误条展示(数据来源:Lind等,1984年)。
Figure 9.6 Figure 3.14 shown as means and standard error bars only (data from Lind et al., 1984).


图9.7 四组各25个观测值,均值为30,标准差为5.9。
Figure 9.7 Four groups of 25 observations each having a mean of 30 and standard deviation of 5.9.

图9.7显示了四组各25个观测值,均值相同(30),标准差相同(5.9)。如果可能,最好在图中展示所有数据,并在文本或表格中给出相关的均值和标准误或置信区间。
symmetry in the data that may not exist. Figure 9.7 shows four groups of 25 observations that all have the same mean (30) and standard deviation (5.9). Where possible it is valuable to show all the data in a figure, with relevant means and standard errors or confidence intervals given in the text or a table.

9.11 小结 9.11 SUMMARY

本章介绍了分析来自一个、两个或多个独立组个体的连续数据,以及两组配对观测值的各种方法。这些方法涵盖了连续数据分析中很大一部分实际问题。然而,许多情况需要更复杂的分析。例如,当每个个体有两个或多个分类变量时(第12章讨论),或当我们关注两个或多个连续变量之间的关系时(第11章)。
This chapter has described various methods for analysing continuous data from one, two, or several independent groups of individuals, and also two sets of paired observations. These methods cover a large proportion of practical problems in the analysis of continuous data. However, there are many circumstances which require a more complicated analysis. For example, when there are two or more classifications for each individual (considered in Chapter 12), or where we are interested in relations between two or more continuous variables (Chapter 11).

我强调了分析方法依赖于对数据的基本假设。我们可以对不满足假设的数据进行任何分析,但结果将无法解释。例如,计算出的两个均值差异的 置信区间实际上并不是一个 的置信区间,而是一个具有某种未知置信水平的区间。同样,如果数据不满足假设,两个样本 检验的 P 值也会以未知的程度不准确。
I have emphasized the dependence of the methods of analysis on the underlying assumptions about the data. We can carry out any analysis on data that do not meet the assumptions, but the results would not be interpretable. For example, the calculated confidence interval for the difference between two means would not in fact be a confidence interval but an interval with some other, unknown level of confidence. Likewise, the P value associated with a two sample test will be wrong to

数据偏离假设(例如正态分布)且对结果有效性影响很小的程度尚不明确—无法给出任何通用规则。当然,没有任何样本数据具有完全的正态分布;假设的前提是样本来自一个正态分布的总体。来自正态分布的样本,尤其是小样本,可能看起来与期望的对称分布很不一样。虽然存在检验假设的正式方法,但这更多依赖经验判断什么是合理的。通常我们会对一组数据进行参数或非参数检验,而不会同时进行。然而,当对参数方法的假设有效性存在疑问时,也会进行非参数分析。如果假设成立,两种方法应给出非常相似的结果;如果结果不同(这又是主观判断),则非参数方法可能更可靠。
an unknown degree if the data do not meet the assumptions. The extent to which data may depart from the assumptions of, for example, having a Normal distribution, with minimal effect on the validity of the results is unclear - it is not possible to give any general rule. Of course no sample of data has an exactly Normal distribution; the assumption is not that it does, but rather that the sample comes from a population which does. Samples from Normal distributions, especially small ones, may look quite unlike the expected symmetric distribution. Although formal methods exist for testing assumptions, this is an area where experience gives a feel for what is or is not reasonable. We would usually carry out either a parametric or a non- parametric test of a set of data, not both. However, sometimes when there are doubts about the validity of the assumptions for the parametric method, a non- parametric analysis is carried out too. If the assumptions are met the two methods should give very similar answers, so if the answers differ (again this is a subjective assessment) then the non- parametric method is likely to be the more reliable.

总结来说,参数和非参数方法均可用于连续数据,且已描述了替代方法。如果满足假设,我倾向于使用参数方法,因为它们更适合估计和置信区间的计算,也更容易扩展到后续章节描述的复杂数据结构。除少数例外,非参数方法难以扩展到更复杂的情况。
In summary, both parametric and non- parametric methods can be used for continuous data, and the alternative approaches have been described. If the assumptions are met I favour the use of parametric methods because they are more amenable to estimation and confidence intervals, and also because they are readily extended to the more complicated data structures described in later chapters. With a few exceptions, non- parametric methods do not extend to more complex situations.

练习 EXERCISES

【9】1 一项研究调查了前八次航天飞机飞行中的全部26名宇航员(Bungo 等,1985)。在自愿基础上,17名宇航员在着陆前摄入大量盐分和液体,以对抗太空适应不良,而9名宇航员则没有。下表显示了航天飞机飞行前后的仰卧心率(次/分钟)。
9.1 A study was made of all 26 astronauts on the first eight space shuttle flights (Bungo et al., 1985). On a voluntary basis 17 astronauts consumed large quantities of salt and fluid prior to landing as a countermeasure to space deconditioning, while nine did not. The table below shows supine heart rates (beats/minute) before and after flights in the space shuttle.

(a) 使用参数法和非参数法分别比较对策组的飞行前后测量值。哪种分析方法更合适?
(a) Compare the pre- and post-flight measurements in the countermeasure group using both a parametric and a non-parametric method. Which analysis is preferable?

(b) 根据(a)部分的答案,进行适当的分析以比较两组心率的变化。
(b) In the light of the answer to
关于该对策的有效性,可以得出什么结论?
(a), perform a suitable analysis to compare the changes in heart rate in the two groups. What conclusion can be made about the effectiveness of the countermeasure?

(c) 两名宇航员各自执行了两次任务,因此在数据集中各出现了两次。这有影响吗?
(c) Two astronauts each flew on two missions and are thus represented twice in the data set. Does this matter?

(d) 请评论该研究的自愿参与性质,以及这可能如何影响结果的解释。
(d) Comment on the voluntary aspect of the study, and how it might affect the interpretation of the results.

采取对策未采取对策
变化变化
7161-1061610
6559-659667
5247-552619
6865-3546814
69690537724
495017810325
49512527725
57603548026
51576527927
55649
58679
576912
597213
536916
537219
537522
487729
均值 56.8863.766.8857.2274.6717.44
标准差7.308.8610.708.4413.0110.11
Countermeasure takenCountermeasure not taken
PrePostChangePrePostChange
7161-1061610
6559-659667
5247-552619
6865-3546814
69690537724
495017810325
49512527725
57603548026
51576527927
55649
58679
576912
597213
536916
537219
537522
487729
Mean 56.8863.766.8857.2274.6717.44
SD7.308.8610.708.4413.0110.11

9.2 下表显示了20名志愿者在免疫前后对III型B群链球菌(GBS)抗体浓度的测定结果(Baker等,1980)。
9.2 The table below shows concentrations of antibody to Type III Group B Streptococcus (GBS) in 20 volunteers before and after immunization (Baker et al., 1980).

III型GBS抗体浓度
免疫前免疫后4周
10.40.4
20.40.5
30.40.5
40.40.9
50.50.5
60.50.5
70.50.5
80.50.5
90.50.5
Antibodyconcentration to Type III GBS
Before immunization4 weeks after immunization
10.40.4
20.40.5
30.40.5
40.40.9
50.50.5
60.50.5
70.50.5
80.50.5
90.50.5
III型GBS抗体浓度
免疫前免疫后4周
100.60.6
110.612.2
120.71.1
130.71.2
140.80.8
150.91.2
160.91.9
171.00.9
181.02.0
191.68.1
202.03.7
Antibodyconcentration to Type III GBS
Before immunization4 weeks after immunization
100.60.6
110.612.2
120.71.1
130.71.2
140.80.8
150.91.2
160.91.9
171.00.9
181.02.0
191.68.1
202.03.7

(a) 本研究报告中对抗体水平的比较总结为 。请对该结果进行评论。
(a) The comparison of the antibody levels was summarized in the report of this study as . . Comment on this result.

(b) 哪种方法更适合分析这些数据?请用合适的方法分析数据并评论结果。
(b) What method would be more appropriate to analyse these data? Analyse the data with the appropriate method and comment on the result.

9.3 利用表9.7中的数据计算霍奇金病和非霍奇金病患者中 细胞数量比较的 置信区间。
9.3 Using the data in Table 9.7 calculate a confidence interval for the comparison of numbers of cells in Hodgkin's disease and non- Hodgkin's disease patients.

9.4 门诊化疗患者被随机分配接受有效的抗恶心治疗或安慰剂(Williams等,1989)。下表显示了使用100毫米线性模拟自评量表测量的恶心程度(单位:毫米)。
9.4 Patients receiving chemotherapy as outpatients were randomized to receive either an active antiemetic treatment or placebo (Williams et al., 1989). The following table shows measurements (in mm) on a linear analogue self- assessment scale for nausea.

治疗组
有效组 (n = 20)安慰剂组 (n = 20)
00
010
012
015
015
230
735
838
1042
1345
Treatment group
Active (n = 20)Placebo (n = 20)
00
010
012
015
015
230
735
838
1042
1345

226 比较组别 - 连续数据
226 Comparing groups - continuous data

治疗组
活性组 (n = 20)安慰剂组 (n = 20)
1550
1850
2060
2064
2168
2271
2374
3082
5286
7695
Treatment group
Active (n = 20)Placebo (n = 20)
1550
1850
2060
2064
2168
2271
2374
3082
5286
7695

确定并执行适当的分析以比较两组的数值。
Identify and carry out an appropriate analysis to compare the values in the two groups.

9.5 非吸烟者中与吸烟者同住者的尿中可替宁排泄量被测量。下表显示了论文(Matsukura 等,1984)中给出的结果摘要。
9.5 Urinary cotinine excretion was measured in nonsmokers who lived with smokers. The following table shows the summary of findings given in the paper (Matsukura et al., 1984).

吸烟者每日吸烟数量样本量 n尿中可替宁排泄量(μg/mg 肌酐)均值(标准误)
1-9250.31 (0.08)
10-19570.42 (0.10)
20-29990.87 (0.19)
30-39381.03 (0.25)
> 40281.56 (0.57)
未说明250.56 (0.16)
Cigarettes smoked per day by smokernUrinary cotinine excretion (μg/mg of creatinine) mean (se)
1-9250.31 (0.08)
10-19570.42 (0.10)
20-29990.87 (0.19)
30-39381.03 (0.25)
&gt; 40281.56 (0.57)
Unspecified250.56 (0.16)

(a) 你能如何描述这些非吸烟者尿中可替宁分布的形态?
(a) What can you say about the shape of the distribution of urinary cotinine in these nonsmokers?

(b) 采用何种适当的分析方法来检验吸烟数量与尿中可替宁水平之间是否存在系统性关系?
(b) What would be an appropriate analysis to see if there was a systematic relation between number of cigarettes and urinary cotinine levels?

(c) 标准误在各组间差异很大,这是否重要?
(c) Does it matter that the standard errors vary greatly among the groups?

(d) 作者使用了多重 检验比较组间配对,并对多重检验进行了校正。请评论他们分析方法的适当性。
(d) The authors used multiple tests to compare pairs of groups with a correction for multiple testing. Comment on the appropriateness of their analysis.

9.6名慢性肾功能衰竭接受血液透析的患者被分为低或正常血浆肝素因子II(HC II)水平组(Toulon 等,1987年)。
9.6 Patients with chronic renal failure undergoing haemodialysis were divided into groups with low or normal plasma heparin cofactor II

五个月后,通过分析透析前后采集的血浆样本,研究了血液透析的急性影响。由于透析会增加血浆中总蛋白浓度,计算了HC II与蛋白的比值,结果见下表:
(HC II) levels (Toulon et al., 1987). Five months later the acute effects of haemodialysis were studied by analysing plasma samples taken before and after haemodialysis. As dialysis increases total protein concentration in plasma, the ratio of HC II to protein was calculated, with the results shown in the following table:

组1(低)透析前透析后组2(正常)透析前透析后
1.411.472.112.15
1.371.451.852.11
1.331.501.821.93
1.131.251.751.83
1.091.011.541.90
1.031.141.521.56
0.890.981.491.44
0.860.891.441.43
0.750.951.381.28
0.750.831.301.30
0.700.751.201.21
0.690.711.191.30
Group 1 (low) beforeafterGroup 2 (normal) beforeafter
1.411.472.112.15
1.371.451.852.11
1.331.501.821.93
1.131.251.751.83
1.091.011.541.90
1.031.141.521.56
0.890.981.491.44
0.860.891.441.43
0.750.951.381.28
0.750.831.301.30
0.700.751.201.21
0.690.711.191.30

对每组数据分别进行配对Wilcoxon检验,得到组1的,组2的。为什么像作者那样得出组1中HC II活性增加而组2中未增加的结论是错误的?请对这些数据进行更合理的分析。
The data were analysed by separate paired Wilcoxon tests on the data for each group, giving for group 1 and for group 2. Why is it wrong to conclude, as the authors did, that HC II activity increased in group 1 but not in group 2? Carry out a better analysis of these data.

【9】7 通过一项随机双盲安慰剂对照试验评估了格司通对无症状子宫内膜异位症患者的影响(Thomas 和 Cooke,1987)。治疗前,每位患者根据美国生育学会制定的评分量表进行评分,治疗24周后重复评分,结果如下(高分表示病情更严重):
9.7 The effect of gestrinone on patients with asymptomatic endometriosis was evaluated in a randomized double- blind placebo controlled trial (Thomas and Cooke, 1987). Before treatment each patient was given a score on a scale derived by the American Fertility Society, and this was repeated after 24 weeks' treatment, with the following results (high scores indicate more serious disease):

(a) 确定并进行治疗后评分的合适比较。
(a) Identify and carry out a suitable comparison of the scores after treatment.

(b) 两组治疗前的评分略有差异,安慰剂组中有六个最高分中的五个。一个简单的调整方法是考虑试验期间评分的变化。确定并进行两组评分变化的合适比较。
(b) The pre-treatment scores were somewhat different in the two groups, with five of the six highest scores being in the placebo group. A simple way to allow for this difference is to consider the change in scores over the period of the trial. Identify and carry out a suitable comparison of the changes in scores in the two groups.

安慰剂组格司通组
治疗前治疗后治疗前治疗后
11011
21121
31231
42041
52051
62261
72371
83382
93592
1035102
1135112
1239122
1351132
1455143
1564153
16610163
17612174
185
Placebo groupGestrinone group
Before treatmentAfter treatmentBefore treatmentAfter treatment
11011
21121
31231
42041
52051
62261
72371
83382
93592
1035102
1135112
1239122
1351132
1455143
1564153
16610163
17612174
185

9.8 可以使用什么检验来比较练习3.1中两组患者的SI值?
9.8 What test could be used to compare the SI values in the two groups of patients shown in Exercise 3.1?

10 比较组别—分类数据 10 Comparing groups - categorical data

10.1 引言 10.1 INTRODUCTION

分类数据在医学研究中非常常见,通常是将个体归入两个或多个互斥的组别之一。在一个样本中,落入某一特定组别的个体数称为频数,因此分类数据的分析就是频数的分析。当比较两个或多个组别时,数据通常以频数表的形式展示。表10.1展示了一个频数表示例—本章后面将用这些数据来说明一种分析方法。频数表也可以看作是两个分类变量的交叉列联表,这两个变量中的一个或两个都可以是有序的。
Categorical data are very common in medical research, arising when individuals are categorized into one of two or more mutually exclusive groups. In a sample of individuals the number falling into a particular group is called the frequency, so the analysis of categorical data is the analysis of frequencies. When two or more groups are compared the data are often shown in the form of a frequency table. Table 10.1 shows an example of a frequency table - these data will be used to illustrate one form of analysis later in the chapter. A frequency table can also be considered as a cross- tabulation of two categorical variables, either or both of which can be ordinal.

当其中一个变量只有两个类别时,例如患者是否有某种特定症状,数据可以总结为某一类别中个体数占总数的比例。表10.1的数据可以表达为每个六个鞋码组中接受剖宫产的女性比例。对于这类数据,我将描述以比例或频数表形式表达的分类数据分析。由于这些分析是表达同一信息的不同方式,两种方法得出的结果相同。两者都被描述是因为它们在实际应用中都很常见。
When there are only two categories for one of the variables, for example whether a patient has a particular symptom or not, the data can be summarized as the proportion of the total number of individuals in one of the categories. The data in Table 10.1 can be expressed as the proportion of women having a Caesarean section in each of the six shoe size groups. For this type of data I shall describe the analysis of categorical data expressed either as proportions or as frequency tables. As the analyses relate to alternative ways of expressing the same information, the two methods yield the same answers. Both are described because they are in

表10.1 剖宫产频率与母亲鞋码的关系
Table 10.1 Relation between frequency of Caesarean section and maternal shoe size

剖宫产鞋码
< 444/255/26+总计
576781043
1728364146140308
总计2235424854150351
Caesarean sectionShoe size
&lt; 444/255/26+Total
Yes576781043
No1728364146140308
Total2235424854150351

频数表方法更为常见,但比例比较更可取,因为它能直接提供估计值和置信区间。对于两个变量均有至少三个类别的较大表格,没有简单的替代方法,我们使用适合分析频数表的方法。
common use. The frequency table approach is more common, but the comparison of proportions is preferable because it readily yields estimates and confidence intervals. For larger tables where both variables have at least three categories there is no simple alternative, and we use methods suitable for analysing frequency tables.

本章中,除非另有明确说明,均假设每个个体只有一条观察记录,即观察是独立的。
Throughout the chapter, except where explicitly stated otherwise, it is assumed that there is only one observation per individual - that is, we have independent observations.

10.2 单一比例 10.2 ONE PROPORTION

最简单的情况是我们有一组个体,观察到其中某一比例具有某种特征。我们能对总体中具有该特征的比例说些什么?
The simplest case to consider is when we have a single group of individuals, and have observed that a certain proportion have a particular characteristic. What can we say about the proportion with that characteristic in the population?

10.2.1 置信区间 10.2.1 Confidence interval

假设一位全科医生从其门诊病人登记册中随机抽取了215名女性样本,发现其中39人有哮喘病史。用 表示样本中具有该特征的病例数, 表示样本总数, 表示病例比例,因此本例中 。如第8章所述,比例的相关抽样分布是二项分布。然而,我们通常可以用二项分布的正态近似来计算观察到的比例的标准误,从而得到总体比例的置信区间。当 均大于5时,使用正态近似是合理的;换言之, 都应大于5。这种情况通常成立。
Suppose a general practitioner chooses a random sample of 215 women from the patient register for her general practice, and finds that 39 of them have a history of suffering from asthma. I shall use to denote the number of cases with the characteristic out of a sample size of , and as the proportion of cases, so in this example. As described in Chapter 8, the relevant sampling distribution for a proportion is the Binomial distribution. However, we can usually use the Normal approximation to the Binomial distribution to obtain the standard error of the observed proportion, and so can obtain a confidence interval for the proportion in the population. It is reasonable to use the Normal approximation when both and exceed 5; in other words, both and should exceed 5. This will usually be the case.

如第8.4.3节所示,比例 的标准误为 。因此,观察到的患哮喘女性比例的标准误为 。女性哮喘患病比例的95%置信区间为
As we saw in section 8.4.3, the standard error of a proportion is . So the standard error of the observed proportion of women with asthma is . The 95% confidence interval for the proportion of women with asthma in the population is thus from

即从0.13到0.23。如果我们可以假设这家全科诊所的女性患者代表了全国女性,那么基于此样本,我们可以相当确定全国女性哮喘患病率介于13%至23%之间。
that is from 0.13 to 0.23. If we can assume that the women in this general practice are representative of all women in the country then we can be reasonably sure on the basis of this sample that the national prevalence of asthma in women is between 13 and 23%.

10.2.2 假设检验 10.2.2 Hypothesis test

我们可以检验总体比例等于某个预先指定值的原假设。
We can test the null hypothesis that the population proportion is some

为此,我们使用第8.5节中给出的通用检验统计量,即
pre- specified value. To do this we use the general test statistic given in section 8.5, namely

在原假设下,该统计量近似服从正态分布(对样本量的要求与前一节相同)。因此我们计算
which will have an approximately Normal distribution under the null hypothesis (with the same sample size requirement as in the previous section). We thus calculate

其中, 是预先设定的或“期望”的比例。注意,因为我们是在检验原假设,所以使用的是假设原假设成立时的比例标准误。换句话说,我们有
where is the pre- specified or 'expected' proportion. Note that because we are testing the null hypothesis, we use the standard error of the proportion expected if the null hypothesis is true. In other words, we have

这将与用于计算置信区间的标准误略有不同。如果我们想检验女性哮喘的全国患病率为 的预设假设,我们计算
which will be slightly different from the standard error used to obtain a confidence interval. If we wish to test the pre- specified hypothesis that the national prevalence of asthma in women is , we calculate

因此
and so

根据表 B2,该值对应的 。我们不能拒绝女性哮喘患病率为 的原假设,并使用上述置信区间给出可能包含真实患病率的范围。
which, from Table B2, corresponds to . We cannot reject the null hypothesis that the prevalence of asthma in women is , and use the confidence interval given above to give a range likely to include the true prevalence.

10.2.3 连续性校正 10.2.3 Continuity correction

刚才描述的方法使用连续的正态分布来近似离散的二项分布。图 10.1 展示了以 为例的这两种分布。假设检验基于计算正态分布尾部超过观察值(此处为 39)的面积。当我们对观察频数进行 的小校正,以考虑变量只能取整数值时,正态分布对二项分布的拟合更好。
The method just described uses the continuous Normal distribution as an approximation to the discrete Binomial distribution. Figure 10.1 shows these two distributions for the example just examined, with and . The hypothesis test is based on calculating the tail area of the Normal distribution beyond the observed value, here 39. The Normal distribution corresponds better to the Binomial distribution when we make a small correction of to the observed frequency to allow for the fact that the variable can only take integer values.


图 10.1 二项分布()及其近似的正态分布。
Figure 10.1 Binomial distribution with and with the approximating Normal distribution.

带连续性校正的检验统计量为
The test statistic with the continuity correction is

其中符号 表示忽略两比例差异的符号, 保持不变。连续性校正即是减少观察比例与期望比例之间的差异。显然,随着样本量增加,校正的影响会减小。
where the symbols indicate that the sign of the difference between the proportions is ignored and is unchanged. The continuity correction thus consists of reducing the difference between the observed and expected proportions. Clearly the effect of the correction diminishes as the sample size increases.

对于哮喘数据,带连续性校正的检验统计量为
For the asthma data, the test statistic with the continuity correction is

由于样本量较大,该值仅略低于之前的结果。
which is only slightly lower than before because the sample size is quite large.

10.3 两个独立组的比例比较 10.3 PROPORTIONS IN TWO INDEPENDENT GROUPS

医学研究中最常见的问题之一是比较两个独立组的观察比例。这类
Probably the most common question in medical research involves the comparison of observed proportions in two independent groups. Such

问题可以出现在所有类型的研究中,无论是观察性还是实验性研究。
questions can arise in all types of study, whether observational or experimental.

作为示例,我将考虑一项随机临床试验的数据,该试验比较红外刺激(IRS)与安慰剂对颈椎骨关节病引起的疼痛的影响(Lewith 和 Machin,1981)。安慰剂治疗是模拟经皮电刺激,患者对所接受的治疗保持盲法。共有26名患者参加了试验,但有一人在试验结束前退出。IRS组12名患者中有9人报告疼痛有所改善,而接受安慰剂治疗的13名患者中有4人报告疼痛改善。观察到的改善比例分别为0.75和0.31,差异为0.44。为了计算总体差异的置信区间或进行假设检验,我们需要考虑两个比例差异的抽样分布。
As an example I will consider data from a randomized clinical trial comparing infra- red stimulation (IRS) with a placebo on the pain caused by cervical osteoarthrosis (Lewith and Machin, 1981). The placebo treatment was mock transcutaneous electrical stimulation and the patients were blind to the treatment given. Twenty- six patients were entered into the trial, but one dropped out before the end. Nine of the 12 patients in the IRS group reported an improvement in pain compared with four of the 13 receiving the placebo treatment. The observed proportions improving were thus 0.75 and 0.31, with a difference of 0.44. In order to calculate a confidence interval for the difference in the population or perform a hypothesis test, we need to consider the sampling distribution of the difference between two proportions.

10.3.1 置信区间 10.3.1 Confidence interval

如第8.4.4节所示,观察到的比例差异 的标准误差为
As shown in section 8.4.4, the standard error of the difference between the observed proportions, , is given by

只要样本量和比例不太小, 的抽样分布将近似正态分布。因此,我们可以很简单地计算95%的置信区间为
The sampling distribution of will be approximately Normal as long as the sample size and proportions are not very small. We can thus calculate the 95% confidence interval very simply as

在示例中,观察到的比例差异为
In the example, the difference in observed proportions is

标准误差为
and the standard error is

因此,缓解疼痛比例差异的95%置信区间为
The 95% confidence interval for the difference in proportions with pain relief is thus

或者 0.09 到 0.79。
or 0.09 to 0.79.

10.3.2 假设检验 10.3.2 Hypothesis test

在比较两个比例时,采用类似的方法进行假设检验。差异比例的标准误再次被计算,但因为我们是在假设原假设成立的前提下评估数据的概率,所以计算的标准误略有不同。如果原假设成立,两个样本来自具有相同真实比例的总体,记为 。我们不知道 ,但 都是 的估计值。我们对 的最佳估计是使用两个样本合并的所有数据计算具有该特征的比例,即
A similar approach is adopted when performing a hypothesis test to compare two proportions. The standard error of the difference in proportions is again calculated, but because we are evaluating the probability of the data on the assumption that the null hypothesis is true we calculate a slightly different standard error. If the null hypothesis is true, the two samples come from populations having the same true proportion of individuals with the characteristic of interest, say . We do not know but both and are estimates of . Our best estimate of is given by calculating the proportion with the characteristic using all the data in the two samples combined, which is

因此,在原假设下, 的标准误是基于每组比例均为 的假设计算的,即
The standard error of under the null hypothesis is thus calculated on the assumption that the proportion in each group is , so that we have

如上所述,这个标准误与前一节计算的略有不同。
As noted above, this standard error is not quite the same as that calculated in the previous section.

的抽样分布服从正态分布,因此我们计算标准正态偏差 ,公式为
The sampling distribution of is Normal, so we calculate a standard Normal deviate, , as

在这个例子中,观察到的比例差为 ,与之前相同。两个比例分别是 9/12 和 4/13,因此在零假设下总体比例的合并估计为
In the example, the difference in observed proportions is as before. The two proportions were 9/12 and 4/13, so the pooled estimate of the population proportion under the null hypothesis is

比例差的标准误为
and the standard error of the difference in proportions is

因此检验统计量为 ,根据表 B2 得到 。这表明两种治疗之间存在差异。然而如前所述,由于样本量较小,差异的置信区间较宽。
The test statistic is thus , which from Table B2 gives . Thus there is evidence of a difference between the treatments. As shown earlier, however, the confidence interval for the difference is wide because the samples are small.

10.3.3 连续性校正 10.3.3 Continuity correction

与单样本情况类似,当比较两个比例时,尤其是在样本量较小时,建议使用连续性校正。其作用是略微减小两个比例之间的观察差异。修正后的 统计量公式为
As with the single sample case, it is advisable to use a continuity correction when comparing two proportions, especially when the samples are small. The effect is to reduce slightly the observed difference between the two proportions. The modified formula for is

其中 保持不变。可以看到,分子(上方)的额外项基于分母(下方)中已计算的量。在我们的例子中,连续性校正后的检验统计量为
where is unchanged. It can be seen that the extra term in the numerator (on the top) is based on a quantity already calculated in the denominator (on the bottom). In our example the continuity corrected test statistic is

对应的
which corresponds to

连续性校正对检验统计量产生了相当大的影响,因为样本量较小。从公式中的额外项可以明显看出,随着样本量的增加,校正的影响会减小。
The continuity correction has made quite a large impact on the test statistic because the samples were small. It is clear from the extra term in the formula that the impact of the correction diminishes as the sample sizes increase.

建议在一样本和两样本检验中常规使用连续性校正。没有校正时,结果往往稍显乐观,使得 值偏小。在本例中,使用校正后得到的 值较大,现已超过 水平。我们仍可报告有证据表明两种治疗效果存在差异,但强度不及未校正分析所示。
It is advisable to use the continuity correction routinely for both one and two sample tests. Without it results tend to be slightly optimistic, so that the values are too small. In the example, the use of the correction gives a rather larger value which is now above the level. We can still report that there is evidence to suggest a difference in effectiveness of the two treatments, but it is not as strong as was suggested by the uncorrected analysis.

由于用于计算置信区间的标准误与假设检验中使用的不同,偶尔会出现如本例中置信区间排除原假设指定值而假设检验结果却不显著的情况。这种解释上的差异并不重要。注意,构建置信区间时不需要连续性校正,因为我们不是基于分布尾部概率进行计算。
Because the standard error used for calculating the confidence interval differs from that used in the hypothesis test it can occasionally happen, as here, that the confidence interval excludes the value specified under the null hypothesis when the hypothesis test gives a non- significant result. The difference in interpretation will not be important. Note that no continuity correction is necessary for constructing a confidence interval as we are not calculating probabilities based on the tail area of a distribution.

10.4 两个配对比例 10.4 TWO PAIRED PROPORTIONS

在某些情况下,我们可能会在同一组个体身上观察到两个比例。我们可能希望比较同一受试者使用两种不同镇痛药后的镇痛效果,或者比较治疗前后具有某一特定症状的受试者比例。
There are several circumstances in which we may observe two proportions on the same individuals. We may wish to compare the pain relief by two

当我们希望比较两个配对组中的某一特征时,也会遇到统计学上相同的问题。
different analgesics in the same subjects or to compare the proportion of subjects with a particular symptom before and after treatment. A statistically identical problem arises when we wish to compare one characteristic in two pair- matched groups.

例如,Karacan 等人(1976)比较了32名大麻使用者与32名匹配对照组在睡眠困难方面的差异。32名大麻使用者中有7人(22%)报告有时或总是存在睡眠困难,而对照组中有13人(41%)报告存在此问题。由于这两个组是个体配对的,我们不应将观察视为独立,因此需要采用与上一节所述不同的方法。我们将看到,如果仅知道两个比例,是无法进行适当分析的。
As an example, Karacan et al. (1976) compared a group of 32 marijuana users with 32 matched controls with respect to their sleeping difficulties. Seven of the marijuana users reported sleep difficulties sometimes or always compared with 13 of the controls. Because the groups were individually matched we should not treat the observations as independent and thus need different methods from those described in the previous section. We will see that we cannot perform the appropriate analyses if we know only the two proportions.

10.4.1 置信区间 10.4.1 Confidence interval

我们希望计算两个比例 之差的置信区间,而这两个观察组并非独立。因此,差值的标准误不仅仅基于各比例的方差,还必须以某种方式考虑配对结果。
We want to calculate a confidence interval for the difference between two proportions and where the two groups of observations are not independent. The standard error of the difference is not, therefore, based simply on the variances of each proportion but must take account of the paired results in some way.

我们可以将配对观察分为四组,根据每对成员中该特征是否存在,如表10.2所示。我们希望比较的两个比例是 。这两个比例并非独立,因为它们都包含了 ,即“是-是”对的数量。然而,比例差异为
We can divide the paired observations into four groups, according to whether the characteristic is present or not in each member of the pair, as shown in Table 10.2. The two proportions we wish to compare are and . These proportions are not independent as they both contain , the number of Yes- Yes pairs. The difference in proportions is, however, given by

因此, 消失了,这颇为令人惊讶。尽管如此,
so that the number disappears, which is rather surprising. Nevertheless.

表10.2 配对特征组合的频数
Table 10.2 Frequency of each combination of paired characteristics

观察对数
12
a
b
c
d
总计n
ObservationNumber of pairs
12
YesYesa
YesNob
NoYesc
NoNod
Totaln

我们仍在比较非独立比例。比例差异的标准误为
we are still comparing non- independent proportions. The standard error of the difference in proportions is given by

(此公式的推导此处不予给出。)因此, 的95%置信区间为
(The derivation of this formula will not be given here.) The confidence interval for is thus obtained as

在我们的例子中,需要知道表10.3中给出的 的值。我们有 ,因此观察到的比例差异为
In our example, we need to know the values and which are shown in Table 10.3. We have and , so the observed difference in proportions is

其标准误差为
and its standard error is

表10.3 大麻使用者与匹配对照组报告睡眠困难的人数(Karacan 等,1976)
Table 10.3 Numbers of marijuana users and matched controls reporting sleeping difficulties (Karacan et al., 1976)

睡眠困难
大麻组对照组配对数
a = 4
b = 3
c = 9
d = 16
总计n = 32
Sleep difficulties
Marijuana groupControl groupNumber of pairs
YesYesa = 4
YesNob = 3
NoYesc = 9
NoNod = 16
Totaln = 32

所以,经历睡眠困难的比例差的 置信区间为
So the confidence interval for the difference in the proportions experiencing sleep difficulties is

即 -0.01 到 0.39。因此,有一些微弱证据表明大麻使用者经历的睡眠困难少于对照组,但差异的置信区间非常宽。
or - 0.01 to 0.39. There is thus some weak evidence that marijuana users experience fewer sleeping difficulties than controls, but the confidence interval for the difference is very wide.

10.4.2 假设检验 10.4.2 Hypothesis test

我们也可以对配对比例无差异的原假设进行显著性检验。与两个独立样本一样,我们需要在原假设成立的前提下评估差异的标准误差,这意味着将 都替换为 。前一节给出的标准误差公式因此简化为
We can also perform a significance test of the null hypothesis that there is no difference between the paired proportions. As with two independent samples, we need to evaluate the standard error of the difference on the assumption that the null hypothesis is true, which means that we replace both and by . The formula for the standard error given in the previous section thus simplifies to

我们计算检验统计量为
and we calculate our test statistic as

这是统计学中最简单的公式之一。该公式的另一种推导见第10.4.4节。
which is one of the simplest formulae in statistics. An alternative derivation of this formula is given in section 10.4.4.

在本例中,我们得到
In the example we get

得到 (\mathbf{P} = 0.08)。我们不能在5%的显著性水平下拒绝原假设。注意,无论我们在公式中取 (b - c) 还是 (c - b),结果都是一样的,因为 (z = +1.73) 会给出相同的双侧 (\mathbf{P}) 值。
giving . We cannot reject the null hypothesis at the level. Note that it does not matter whether we take or in the equation, as would give the same two- sided value.

10.4.3 连续性校正 10.4.3 Continuity correction

在比较配对比例时,尤其是样本量较小时,我们应使用连续性校正。与非配对情况类似,我们使用公式
We ought to use a continuity correction when comparing paired proportions, especially in small samples. As with the unpaired case we use the formula

但这里两个样本大小相同,所以我们得到
but here the two samples are the same size, so we get

换句话说,使用连续性校正时,我们在除以 之前,从 的绝对差中减去 1。
In other words, to use the continuity correction we subtract 1 from the absolute difference between and before dividing by .

在我们的例子中,我们有
In our example we have

对应的 。正如我们在前一节看到的,连续性校正在小样本中影响显著。使用它总会使 值增大。
corresponding to . As we saw in the previous section, the effect of the continuity correction is quite marked in small samples. Its use will always increase the value.

10.4.4 基于二项分布的另一种推导 10.4.4 An alternative derivation based on the Binomial distribution

如上所示,比较配对比例的假设检验仅基于显示不一致的对数 。显示一致的对数 并未出现在公式中。
As shown above, the hypothesis test for comparing paired proportions is based only on the numbers of pairs showing disagreement, and . Those showing agreement, and , do not appear in the formula.

因此,另一种考虑问题的方法是观察不一致对的总数 。在原假设下,我们期望“是-否”和“否-是”对的数量相同,因此可以评估观察到 中属于其中一组的概率(或等价地, 中的概率)。 服从参数为 的二项分布。由于 ,即使样本较小,二项分布的正态近似也非常好。 的标准误为
Another way of considering the problem, therefore, is to look at the total number of disagreements, . Under the null hypothesis we expect the numbers of 'Yes- No' and 'No- Yes' pairs to be the same so we can evaluate the probability of observing out of to be in one of these groups (or, equivalently, out of ). The number will follow a Binomial distribution with . Because is 0.5 the Normal approximation to the Binomial distribution is very good even for quite small samples. The standard error of is

统计量 计算公式为
The statistic is calculated as

如前所述。该检验与第9.4.4节介绍的符号检验完全相同。这里的比较是以比例形式表达,而之前的描述是以实际频数表示,但两者完全等价。我们还会遇到其他简化为单一比例的二项检验的检验方法。当数据以频数表形式表达时,该检验通常称为McNemar检验,在第10.7.5节中有详细讨论。
as before. This test is identical to the sign test which was introduced in section 9.4.4. Here the comparison is expressed in terms of the proportions whereas in the earlier description it was in terms of the actual frequencies, but the two are exactly equivalent. We will meet other tests which reduce to a simple Binomial test of a single proportion. When the data are expressed as a frequency table the test is usually called the McNemar test, under which name it is discussed in section 10.7.5.

10.4.5 a 和 d 真的是被忽略了吗? 10.4.5 Are a and d really ignored?

分析配对比例的所有公式似乎都只基于显示不一致的配对—表10.2和10.3中的“是-否”或“否-是”。虽然比较两个分类的假设检验结果只依赖于,但置信区间还依赖于样本量。我们预期置信区间和假设检验方法会给出密切对应的结果(由于使用不同标准误,可能有些微差异),下面的例子将证明确实如此。
All of the formulae for analysing paired proportions seem to be based on only those pairs showing disagreement - 'Yes- No' or 'No- Yes' in Tables 10.2 and 10.3. While it is true that the result of the hypothesis test comparing the two classifications depends only on and , the confidence interval depends on the sample size too. We expect the confidence interval and hypothesis testing approaches to give closely corresponding results (with some small discrepancies due to the use of different standard errors) and an example will show that this does indeed happen.

考虑表10.4中显示治疗前后症状有无的两组数据。在两张表(i)和(ii)中,,因此对两组数据检验两个特征无差异的原假设的统计量为
Consider the two sets of data in Table 10.4 showing presence or absence of a symptom before and after treatment. In both tables (i) and (ii) and , so for both of them a test of the null hypothesis that there is no difference between the two features is given by

(为简化说明,本文忽略连续性校正。)由于值几乎恰好为0.05,我们预期两个比例差异的置信区间一端会非常接近零—无论的大小,这种情况是否都成立?
(I shall ignore the continuity correction for this illustrative example.) We would expect the confidence interval for the difference between the two proportions to have one end very close to zero because the value is almost exactly 0.05 - does this happen regardless of the size of and

表10.4 两组配对数据,显示相同数量的“是-否”和“否-是”配对
Table 10.4 Two sets of paired data showing the same numbers of Yes-No and No-Yes pairs

(i)(ii)
症状存在情况症状存在情况
时间1时间2时间1时间2
a = 10a = 51
b = 15b = 15
c = 6c = 6
d = 5d = 33
总计n = 36总计n = 105
(i)(ii)
Presence of symptomPresence of symptom
TimeTimeTimeTimeTime
1212
YesYesa = 10YesYesa = 51
YesNob = 15YesNob = 15
NoYesc = 6NoYesc = 6
NoNod = 5NoNod = 33
Totaln = 36Totaln = 105

两组计算结果如下并列展示:
The two sets of calculations are shown below in parallel:


95% 置信区间为
CI is CI is

即 0.014 到 0.486。
i.e. 0.014 to 0.486.

即 0.002 到 0.170。
i.e. 0.002 to 0.170.

两个置信区间的表现符合预期,下限均接近零。较大样本的比例差异 95% 置信区间明显更窄,这是符合预期的。注意,数据集 (i) 和 (ii) 之间的差异在于 的变化,进而影响 ,而这些变化在仅检验假设 时是看不到的。
Both confidence intervals behave as expected, with the lower limit close to zero. The confidence interval for the difference in proportions is much narrower for the larger sample, as we would expect. Note that the difference between data sets (i) and (ii) is the change in and and thus , none of which is seen when only testing the hypothesis that .

10.5 比较多个比例 10.5 COMPARING SEVERAL PROPORTIONS

在比较不同组受试者的多个比例时,必须考虑两种情况,即组是否有序。这些问题将在第 10.8 节讨论,因为它们更适合在频数表的框架下考虑。
When comparing several proportions relating to different groups of subjects two alternative cases must be considered, according to whether the groups are ordered or not. These problems are discussed in section 10.8, as they are more easily considered in the framework of frequency tables.

超过两个配对比例的比较超出本书范围。相关分析见 Fleiss(1981,第 126 页)。
The comparison of more than two paired proportions is beyond the scope of this book. The analysis is described by Fleiss (1981, p. 126).

10.6 频数表的分析 10.6 THE ANALYSIS OF FREQUENCY TABLES

比例是一种表达计数或频率的方法,适用于只有两种可能结果的情况,例如症状的有无。更一般的频率表示方法是使用表格,其中表格的每个单元格对应于两个或多个分类相关的特征组合。这里我只讨论“两维”表格,即涉及两个分类变量的情况。频率表有时也称为列联表。
Proportions are a way of expressing counts or frequencies when there are only two possible outcomes, such as the presence or absence of a symptom. A more general way of showing frequencies is in a table, where each cell of the table corresponds to a particular combination of characteristics relating to two or more classifications. Here I will deal only with 'two way' tables, which relate to two categorical variables. Frequency tables are sometimes called contingency tables.

所有频率表的分析有一个统一的通用方法,但在实际应用中,分析方法会根据以下因素有所不同:
There is a single, general approach to the analysis of all frequency tables, but in practice the method of analysis varies according to

【1】类别的数量

  1. the number of categories

【2】类别是否有序
2. whether the categories are ordered or not

【3】独立受试者组的数量
3. the number of independent groups of subjects, and

【4】所提问题的性质
4. the nature of the question being asked.

我将先介绍通用方法,然后讨论若干特殊情况。
I will first consider the general approach, and then several special cases.

10.6.1 通用情况— 表 10.6.1 The general case - the table

表10.5给出了一个二维频率表的例子,显示了3888名产前患者中按婚姻状况划分的咖啡因摄入量。虽然我们分析这类数据的方法基于观察频数,但通过将频数表示为行或列总数的百分比,更容易理解数据变化,尤其当行或列总数差异较大时。表10.6展示了表10.5数据按行百分比表示的结果。本节将描述具有列的频率表的一般分析方法—即表。尽管该方法适用于任意大小的表格,但当等于2时,方法可以简化(参见10.7节关于表和10.8节关于表的内容)。
An example of a two way frequency table is given in Table 10.5, which shows caffeine consumption by marital status in a sample of 3888 antenatal patients. Although the methods we use to analyse data of this type are based on the observed frequencies, it is easier to see what is going on by expressing the frequencies as percentages of either the row or column totals, especially when there are large variations among the row or column totals. Table 10.6 shows the data from Table 10.5 expressed as row percentages. In this section I shall describe the general approach to frequency tables with rows and columns - the table. Although this method can be used for tables of any size, if either or is equal to 2, the method can be simplified (see section 10.7 for tables and section 10.8 for tables).

表10.5 产前患者的咖啡因摄入量与婚姻状况(来源:Martin和Bracken,1987)
Table 10.5 Caffeine consumption and marital status in antenatal patients (from Martin and Bracken, 1987)

婚姻状况0咖啡因摄入量(毫克/天)
1-150151-300> 300总计
已婚65215375982423029
离婚、分居或丧偶36463821141
单身21832710667718
总计90619107423303888
Marital status0Caffeine consumption (mg/day)
1-150151-300&gt; 300Total
Married65215375982423029
Divorced, separated or widowed36463821141
Single21832710667718
Total90619107423303888

表10.6 咖啡因摄入量与婚姻状况数据(来自表10.5),以行百分比表示
Table 10.6 Caffeine consumption and marital status data from Table 10.5 expressed as row percentages

婚姻状况咖啡因摄入量(毫克/天)
01-150151-300> 300总计
已婚22%51%20%8%3029(100%)
离婚、分居或丧偶26%33%27%15%141(100%)
单身30%46%15%9%718(100%)
总计23%49%19%8%3888(100%)
Marital statusCaffeine consumption (mg/day)
01-150151-300&gt; 300Total
Married22%51%20%8%3029 (100%)
Divorced, separated or widowed26%33%27%15%141 (100%)
Single30%46%15%9%718 (100%)
Total23%49%19%8%3888 (100%)

频数表的分析主要基于假设检验。原假设是两种分类(咖啡因摄入量和婚姻状况)在相关人群(产前患者)中无关。我们将观察到的频数与假设原假设为真时的期望频数进行比较。期望频数的计算基于整个样本中变量的分布,由行和列的总计数表示。行列分类的组合称为“单元格”。
The analysis of frequency tables is largely based on hypothesis testing. The null hypothesis is that the two classifications (caffeine consumption and marital status) are unrelated in the relevant population (antenatal patients). We compare the observed frequencies with what we would expect if the null hypothesis were true. We base our calculation of the expected frequencies on the distribution of the variables in the whole sample, as indicated by the row and column totals. The combinations of row and column categories are known as cells.

出于第10.6.4节将解释的原因,适当的检验统计量是通过计算表中所有单元格的 之和得到的,其中 分别是观察频数和期望频数。观察值与期望值偏离越远,原假设成立的可能性越小。因此, 的值越大,说明行变量与列变量不独立。
For reasons that will be explained in section 10.6.4 it turns out that the appropriate test statistic is obtained from the observed and expected frequencies, and respectively, by calculating the sum of the quantities for all the cells in the table. The further the observed values are away from the expected values, the less likely is it that the null hypothesis is true. Thus a large value of is evidence that the row and column variables are not independent.

10.6.2 期望频数 10.6.2 Expected frequencies

如果原假设成立,两变量无关(即独立),则个体属于某一行的概率与其所属列无关。表中某单元格的概率即为该行和该列概率的乘积。这些概率用观察到的比例估计。例如,样本中有3029名已婚女性,总样本为3888,则已婚比例为3029/3888。同理,不摄入咖啡因的女性比例为906/3888。因此,若婚姻状况与咖啡因摄入独立,则全样本中既已婚又不摄入咖啡因的期望比例为两者比例的乘积:
If the null hypothesis is true and the two variables are unrelated (i.e. independent) then the probability of an individual being in a particular row is independent of which column they are in. The probability of being in a particular cell of the table is thus simply the product of the probabilities of being in the row and the column containing that cell. These probabilities are estimated using the observed proportions. For example, there were 3029 married women in the sample of 3888, so that the proportion of married women was 3029/3888. Likewise the proportion of women consuming no caffeine was 906/3888. Thus if marital status and caffeine consumption are independent the expected proportion of the whole sample who are married and consume no caffeine is the product of these proportions:

要得到该单元格的期望频数,将比例乘以样本量,得到
To get the expected frequency in that cell of the table we multiply by the sample size, to get

因此,每个单元格的期望频数是相关行和列总数的乘积除以表中所有观察频数的总和(即样本量)。表10.7展示了整张表的期望频数。假设检验基于表10.5和表10.7中频数的差异。如第10.6.4节所述,适当的检验统计量是通过计算得到的
The expected frequency in each cell is thus the product of the relevant row and column totals divided by the sum of all the observed frequencies in the table (i.e. the sample size). Table 10.7 shows the expected frequencies for the whole table. The hypothesis test is based on the difference between the frequencies in Tables 10.5 and 10.7. As explained in section 10.6.4, the appropriate test statistic is obtained by calculating the

表10.7 与表10.5对应的期望频数
Table 10.7 Expected frequencies corresponding to Table 10.5

婚姻状况0咖啡因摄入量(毫克/天)
1-150151-300> 300总计
已婚705.81488.0578.1257.13029
离婚、分居或丧偶32.969.326.912.0141
单身167.3352.7137.060.9718
总计90619107423303888
Marital status0Caffeine consumption (mg/day)
1-150151-300&gt; 300Total
Married705.81488.0578.1257.13029
Divorced, separated or widowed32.969.326.912.0141
Single167.3352.7137.060.9718
Total90619107423303888

表中所有单元格的数量 之和,其中 分别表示观察频数和期望频数。检验统计量 因此为
sum of the quantities for all the cells in the table, where and denote the observed and expected frequencies. The test statistic is thus

其中 表示行号, 表示列号。该公式通常简写为
where indicates the row number and the column number. This formula is often written simply as

注意所有差异 的总和为零,因为观察频数和期望频数均加和为样本量。我们在求和前对差异平方,类似于计算一组观测值围绕其均值的标准差时的做法。
Note that the sum of all the differences is zero because the observed and expected frequencies both add up to the sample size. We square the differences before adding them, as we do when calculating the standard deviation of a set of observations around their mean.

当原假设成立时,统计量 服从卡方分布;这一点在第9.8.6节中已简要介绍。因此该检验通常称为卡方检验。检验统计量常写作 ,但为了区分理论分布,最好称其为
When the null hypothesis is true the statistic has a Chi squared distribution; this was briefly introduced in section 9.8.6. For this reason the test is usually called the Chi squared test. The test statistic is often written , but it is better to call the test statistic to distinguish it from the theoretical distribution.

10.6.3 卡方分布 10.6.3 The Chi squared distribution

卡方分布的定义很简单。如果我们有一个变量 ,它服从标准正态分布,那么 就服从卡方分布。显然, 只能取正值,其分布高度偏斜。这个 的分布有一个自由度,是更广义卡方分布“家族”中最简单的情况。如果我们有多个相互独立且均服从标准正态分布的变量,比如 ,那么所有这些 的平方和 服从自由度为 的卡方分布。
The definition of the Chi squared distribution is simple. If we have a quantity (variable) which has a standard Normal distribution, then has a Chi squared distribution. Clearly can have only positive values, and its distribution is highly skewed. This distribution of has one degree of freedom, and is the simplest case of a more general 'family' of Chi squared distributions. If we have several independent variables, each of which has a standard Normal distribution, say , then the sum of the squares of all the s, , has a Chi squared


图10.2 显示了不同自由度的理论卡方分布。
Figure 10.2 Chi squared distributions with different numbers of degrees of freedom.

卡方分布的自由度为 。图10.2展示了不同自由度下的理论卡方分布曲线。
distribution with degrees of freedom. Figure 10.2 shows theoretical Chi squared distributions with different degrees of freedom.

自由度为1的卡方分布是标准正态分布的平方,因此 的 5% 截断点是正态分布 5% 截断点的平方,即 或 3.84。注意,自由度为1的卡方分布的上尾部分对应标准正态分布的两个尾部。换言之,在假设检验中,我们将 进行比较。
The Chi squared distribution with one degree of freedom is the square of a standard Normal distribution, so the cut- off point for is the square of the cut- off for the Normal distribution, that is, or 3.84. Note that the upper tail of the Chi squared distribution with one degree of freedom corresponds to both tails of the standard Normal distribution. In other words, for a hypothesis test we compare with .

在使用卡方检验分析二维频数表时,自由度为 ,其中 是行数, 是列数。对于 表,我们将检验统计量 与自由度为1的卡方分布比较。表10.5有3行4列,因此我们应将 与自由度为 的卡方分布比较。当原假设成立时,卡方分布的期望值等于自由度。由于观察频数与期望频数的差异被平方,行和列变量非独立性表现为较高的 值。表B5给出了不同自由度卡方分布的上尾面积。可以简单验证,自由度为1时的数值是表B2中对应正态分布双尾面积的平方。接下来的两节将解释为何使用卡方分布分析频数表,以及为何自由度为
The number of degrees of freedom when using the Chi squared test for a two way frequency table is the product , where is the number of rows and the number of columns. For a table, therefore, we compare our test statistic with the Chi squared distribution with one degree of freedom. Table 10.5 has 3 rows and 4 columns so we must refer to the Chi squared distribution with degrees of freedom. The expected value of the Chi squared distribution when the null hypothesis is true is the number of degrees of freedom. Because any differences between observed and expected frequencies are squared, non- independence of the row and column variables is indicated by high values of . Table B5 gives upper tail areas for Chi squared distributions with different degrees of freedom. It is simple to verify that the entries for one degree of freedom are the squares of the corresponding two- tailed areas of the Normal distribution in Table B2. The next two sections explain why we use the Chi squared distribution for analysing frequency tables, and also why the degrees of freedom are .

10.6.4 为什么使用卡方分布

###10.6.4 Why we use the Chi squared distribution

(本节较为理论,数学不复杂,解释了卡方分布作为分析频数表最常用方法的理论基础。可跳过而不影响连贯性。)
(This short section is more theoretical although not highly mathematical. It explains the rationale behind the use of the Chi squared distribution, the most common method for analysing frequency tables. It can be omitted without loss of continuity.)

为什么卡方分布适合分析分类数据?答案涉及泊松分布和正态分布。如果观察多个独立个体,并根据两个分类标准将其划分为互斥组(如表10.5),则在原假设成立时,任一单元格的频数服从泊松分布。进行假设检验时,我们希望比较每个单元格的观察数 与原假设下的期望数 。当 不太小时,泊松分布可被均值为 、标准差为 的正态分布近似。因此, 近似服从标准正态分布,而 近似服从自由度为1的卡方分布。如果有 个独立观察频数,将每个单元的 相加,得到自由度为 的卡方分布。但频数表中并非所有频数独立,因此自由度需调整。
Why is the Chi squared distribution appropriate for the analysis of categorical data? Strangely, the answer to this question involves both the Poisson and Normal distributions. If we observe a number of independent individuals, and categorize them into mutually exclusive groups in relation to two classifications, such as in Table 10.5, then the number in any cell of that table will follow a Poisson distribution if the null hypothesis is true. For the purpose of a hypothesis test we wish to compare the observed number, , in each cell with the number expected, , if the null hypothesis is true. The Poisson distribution can be approximated by a Normal distribution with mean and standard deviation , when is not too small. Thus has approximately a standard Normal distribution, and has approximately a Chi squared distribution with one degree of freedom. If we have independent observed frequencies we can add together the quantities for each to get a Chi squared distribution with degrees of freedom. When analysing frequency tables not all of the frequencies are independent, however, so we must modify the degrees of freedom.

10.6.5 自由度 10.6.5 Degrees of freedom

如10.6.2节所示,任一单元格的期望频数是相关行列总数的乘积除以总样本量。期望频数由观察的行列总数计算,因此卡方检验是“条件”于这些总数的。由于使用了观察的总数,期望频数并非全部独立。以表10.5第一行为例,期望频数分别为705.8、1488.0、578.1和257.1(见表10.7)。但第一行期望频数之和等于观察频数之和,即3029。因此,如果已知该行其他期望频数,任一期望频数可被推算出来。每行均如此,故表中只有 个独立列,同理只有 个独立行,因而共有 个独立频数。检验统计量 在原假设下服从自由度为 的卡方分布。
As shown in section 10.6.2, the expected frequency in any cell is the product of the relevant row and column totals divided by the total sample size. The expected frequencies are calculated from the observed row and column totals, and so the Chi squared test is 'conditional' on these totals. Because of the use of observed totals the expected frequencies are not all independent. Consider the first row of Table 10.5. The expected frequencies are 705.8, 1488.0, 578.1 and 257.1, as shown in Table 10.7. We know. however, that the sum of the expected frequencies in the first row is the same as the sum of the observed frequencies, that is 3029. Any of the expected values can therefore be obtained if we already know all the others in that row. The same applies to every row. There are thus only independent columns in the table. Likewise there are only independent rows, and consequently independent frequencies. The test statistic thus follows the Chi squared distribution with degrees of freedom under the null hypothesis.

我们可以在 表格中非常简单地看到上述过程,一旦知道其中一个期望频数,就能求出所有四个期望频数。
We can see the above process very simply in the table, for which all four expected frequencies can be obtained once we have one of them.

因此只有一个自由度,这与通用公式 相符。
There is thus only one degree of freedom, agreeing with the general formula of .

10.6.6 表的卡方检验 10.6.6 The Chi squared test for an table

在第10.6.2节中,我介绍了检验统计量 ,用于评估行和列所表示的分类变量是否独立的原假设。我们可以在假设原假设成立的前提下,计算表中每个单元格的期望频数,然后计算 如下:
In section 10.6.2 I introduced the test statistic for evaluating the null hypothesis that the categorical variables denoting the rows and columns are independent. We can calculate expected frequencies in each cell of the table on the assumption that the null hypothesis is true, and then calculate as

其中 分别表示行号和列号。表10.8展示了表10.5中每个单元格对检验统计量的贡献,计算得出 。根据表B5,自由度为6的卡方分布在上尾截断0.1%的临界值为22.46,因此在这组女性样本中,婚姻状况与咖啡因摄入量之间存在高度显著的关联 。第10.9.1节对该数据集有进一步讨论,考虑了其中一个变量具有有序类别的情况。
where and indicate the row and column numbers. Table 10.8 shows the contribution of each cell of Table 10.5 to the test statistic, which is . From Table B5 the value of the Chi- squared distribution with 6 degrees of freedom which cuts off in the upper tail is 22.46, so there is a highly significant association between marital status and caffeine consumption in this sample of women. In section 10.9.1 there is further discussion of this data set that takes account of the fact that one of the variables has ordered categories.

表10.8 表10.5中每个单元格对 的贡献
Table 10.8 Contributions of each cell in Table 10.5 to

婚姻状况0咖啡因摄入量(毫克/天)
1-150151-300> 300总计
已婚4.111.610.690.897.30
离婚、分居或丧偶0.307.824.576.8219.51
单身15.361.887.020.6024.86
总计19.7711.3112.288.3151.66
Marital status0Caffeine consumption (mg/day)
1-150151-300&gt; 300Total
Married4.111.610.690.897.30
Divorced, separated or widowed0.307.824.576.8219.51
Single15.361.887.020.6024.86
Total19.7711.3112.288.3151.66

10.6.7 解释 10.6.7 Interpretation

许多统计分析涉及评估变量间可能的关联,尤其是卡方检验以及用于两个连续变量关系的等效方法—相关分析(见第11章)。必须认识到,观察到的关联不一定表示变量间存在因果关系。我们不应仅凭数据推断婚姻状况影响咖啡因摄入量,亦不能断定咖啡因摄入量影响婚姻状况,除非有外部证据支持。非常
Many statistical analyses involve evaluation of possible associations between variables, notably the Chi squared test and the equivalent method for relating two continuous variables, correlation (see Chapter 11). It is essential to realize that an observed association does not necessarily indicate a causal relation between variables. We should not infer that marital status influences caffeine consumption, nor indeed that caffeine consumption influences marital status, without external evidence. Very

通常,如本例所示,会有其他因素同时影响两个变量。关于关联解释的进一步讨论见第11.8节。
often, as in this example, there will be other factors that influence both variables. Further discussion of the interpretation of association is given in section 11.8.

另一个问题是如何解释两个变量之间观察到的关联,这两个变量各自具有多个类别,如咖啡因的例子。仅仅说两个变量有关联,通常信息量不大。例如,我们可能想知道三个婚姻状况组中是否有一个组与另外两个组不同。这是一个多重比较问题,可与第9.8.4节中讨论的连续变量的情况类比。一种方法是对每对组进行比较,或者如果有先验假设认为某一组可能不同,则将该组与其他组合并的数据进行比较。这些方法并不理想,因为它们涉及某些临时或主观的分析。只有当整体分析显示有偏离原假设的证据(例如 )或存在具体先验假设时,才应对大型表格的子集进行进一步检验。幸运的是,如下所述,我们不常处理这类分析。特别地,只有两行(或两列)的表格分析见第10.7节和第10.8节。
A different problem is the interpretation of an observed association between two variables each of which has several categories, as in the caffeine example. Just saying that the two variables are associated is often not very informative. We might wish to know, for example, if one of the three marital status groups differs from the other two groups. Here we have a multiple comparison problem comparable to that for continuous variables discussed in section 9.8.4. One way to proceed is to make comparisons between each pair of groups, or if there is some prior hypothesis that one group might differ then that group could be compared with the combined data from the other groups. These procedures are not ideal because they involve some ad hoc or subjective analyses. The further testing of subsets of a large table should only be carried out if the overall analysis shows some evidence of departure from the null hypothesis (perhaps ) or where some specific prior hypothesis exists. Fortunately, as noted below, we do not often have to deal with this type of analysis. In particular, the analysis of tables with only two rows (or columns) is discussed in sections 10.7 and 10.8.

对解释的最后一个重要提醒是, 的大小(或 值)并不表示关联的强度,而是表示反对无关联原假设的证据强度。
One last important comment on interpretation is the reminder that the size of (or ) does not indicate the strength of the association, but rather the strength of the evidence against the null hypothesis of no association.

10.6.8 样本量 10.6.8 Sample size

如第10.6.4节所述, 统计量采用卡方分布的基础是“大样本”近似。在频数表的背景下,有一些相当明确的指导原则说明频数需要多大才能使该方法有效。该指导原则归功于统计学家W. G. Cochran,要求表中80%的单元格的期望频数应大于5,且所有单元格的期望频数应大于1。注意这里不涉及观察频数,仅涉及期望频数。
As described in section 10.6.4, the use of Chi squared distribution for the test statistic is based on a 'large sample' approximation. In the context of frequency tables there are some fairly clear guidelines on how large the frequencies need to be for the method to be valid. The guidelines. attributed to the statistician W. G. Cochran, are that of the cells in the table should have expected frequencies greater than 5, and all cells should have expected frequencies greater than 1. Notice that the observed frequencies are not involved here, only the. expected frequencies.

如果某个单元格的期望频数非常小,它会对 值产生极大贡献。例如,如果观察到某单元格中有1个样本,而期望频数为0.1,则该单元格对 的贡献为 ,足以使一个 表格产生显著结果,而不管其他频数如何。
If any cell had a very small expected frequency it would contribute enormously to the value of . For example, if we observe one subject in a cell with an expected frequency of 0.1, the contribution of that cell to would be , enough to give a significant result in a table regardless of the other frequencies.

如果表中有太多小的期望频数,应找到合理的方法合并行和/或列变量中的某些类别。对于 表格中小频数的情况,有专门方法(见第10.7.2节)。
If we have a table with too many small expected frequencies we should find some sensible way to combine some of the categories in the row and/or column variables. There is a special method for tables with small frequencies (section 10.7.2).

10.6.9 特殊类型的频数表 10.6.9 Particular types of frequency table

频数表的卡方检验已经在其最一般的形式下进行了讨论和示例。有两个因素决定了特殊类型的表格及其不同的分析方法:第一,若一个变量(或两个变量)的类别是有序的;第二,若一个变量(或两个变量)仅有两个类别。实际上,大型表格中很少出现两个变量都无序的情况。事实上,用于说明该方法的咖啡因数据中一个变量是有序的,我稍后会回到该数据集。
The Chi squared test for the frequency table has been discussed and illustrated in its most general form. There are two considerations that determine special types of table and lead to different analyses: firstly if the categories of one variable (or both) are ordered, and secondly if one variable (or both) has only two categories. In practice large tables are rare where neither variable has ordered categories. Indeed the caffeine data used to illustrate the method had one variable ordered, and I shall return to that data set later.

有序类别的重要性在第9.8节中讨论过,同样适用于分类变量。如果分析的是有序变量的数据,通常希望了解是否存在跨有序组的趋势,而不仅仅是组间是否存在差异。这种更具体的可能性允许更敏感(更有力)的统计分析。
The importance of ordered categories was discussed in section 9.8, and the same argument applies to categorical variables. If we are analysing data for an ordinal variable we will usually wish to know if there is some trend across the ordered groups rather than just whether the groups differ. This more specific possibility allows for a more sensitive (powerful) statistical analysis.

当一个变量只有两个类别时,这种情况非常重要,因为数据也可以被视为比例;分析结果与第10.3节和10.5节中描述的比较比例的方法完全等价。此外,虽然我们仍使用相同的通用公式 ,但可以简化计算。最简单的频数表,即 表,实际上有其自身的某些问题,尤其是在样本量较小时。需要考虑的各种表格类型列在表10.9中,并标明了分析所在的章节编号。
The case when one variable has only two categories is important because the data can also be considered as proportions; the analyses turn out to be precisely equivalent to the methods for comparing proportions described in sections 10.3 and 10.5. Also, although we still use the same general formula of , it can be simplified for easier calculation. The simplest frequency table, the table, turns out to have certain problems all of its own, especially for small samples. The various types of table to consider are listed in Table 10.9 with the numbers of the sections in which the analysis is described.

表10.9 根据类别数量(2或 )及类别是否有序的不同频数表类型
Table 10.9 Different types of frequency table, according to number of categories (2 or ) and whether categories are ordered

类别数量书中章节
变量1变量2
2210.7
23+ 无序10.8.1
23+ 有序10.8.2
3+ 无序3+ 无序10.6.6
3+ 有序3+ 无序10.9.1
3+ 有序3+ 有序10.9.2
Number of categoriesSection of book
Variable 1Variable 2
2210.7
23+ not ordered10.8.1
23+ ordered10.8.2
3+ not ordered3+ not ordered10.6.6
3+ ordered3+ not ordered10.9.1
3+ ordered3+ ordered10.9.2

表的分析是医学研究中最常见的,因此我将首先讨论它。
The analysis of tables is one of the most common in medical research, so I shall consider it first.

10.7 频数表—两比例比较 10.7 FREQUENCY TABLES - COMPARISON OF TWO PROPORTIONS

表的分析遵循与较大表相同的基本方法,但有一些特别需要注意的特点。表10.10展示了一项在游泳者中进行的病例对照研究数据,旨在调查暴露于氯化游泳池水与牙釉质侵蚀之间的可能关联。在49名牙釉质侵蚀的游泳者(病例)中,有32人报告每周游泳六小时及以上,而在245名无牙釉质侵蚀的游泳者(对照)中,有118人达到此游泳时间。我们可以看到,尽管数据以 频数表形式呈现,但组间比较实际上是两个比例的比较。我将在第10.7.4节中证明,卡方检验与第10.3节中比较两比例的假设检验完全等价。
The analysis of tables follows the same basic method as used for larger tables, but there are some particular features to note. Table 10.10 shows data from a case- control study carried out among swimmers to investigate the possible association between exposure to chlorinated swimming pool water and erosion of dental enamel. Among 49 swimmers with enamel erosion (the cases) 32 reported swimming six or more hours per week, compared with 118 of 245 swimmers without enamel erosion (the controls). We can see that, although the data are displayed as a frequency table, the comparison of the groups is in fact a comparison of two proportions. I shall show in section 10.7.4 that the Chi squared test is exactly equivalent to the hypothesis test for comparing two proportions given in section 10.3.

表10.10 游泳者游泳时间与是否有牙釉质侵蚀的比较(Centerwall 等,1986)
Table 10.10 Comparison of number of hours' swimming by swimmers with or without erosion of dental enamel (Cen terwall et al., 1986)

每周游泳时间牙釉质侵蚀
有(病例)无(对照)总计
≥ 6 小时32118150
< 6 小时17127144
总计49245294
Amount of swimming per weekErosion of dental enamel
Yes (cases)No (controls)Total
≥ 6 hours32118150
&lt; 6 hours17127144
Total49245294

零假设是牙釉质侵蚀与游泳时间(即氯化水暴露量)无关。为了进行卡方检验,我们需要计算零假设成立时的期望频数。若用 表示四个观察频数(见表10.11),计算将更为方便。
The null hypothesis is that enamel erosion is unrelated to amounts of swimming (and hence exposure to chlorinated water). To perform a Chi squared test we need to calculate the expected frequencies if the null hypothesis is true. It will help in the calculations if we use and to denote the four observed frequencies, as in Table 10.11.

表10.11 通用的 频数表
Table 10.11 General frequency table

列1列2总计
行1aba + b
行2cdc + d
总计a + cb + dN
Column 1Column 2Total
Row 1aba + b
Row 2cdc + d
Totala + cb + dN

正如我们在10.6.2节中看到的,单元格中的期望频数是相关行和列总计的乘积除以样本量。例如,对于观察频数为 的单元格,期望值为 。对于表10.10中的数据,期望频数及其对 的贡献如表10.12所示。差值 对所有四个单元格来说,除了符号不同外,都是相同的,这对于所有 表都成立。这说明我们只有一个独立观察值,而非四个,因此自由度只有1。
As we saw in section 10.6.2 the expected frequency in a cell is the product of the relevant row and column totals divided by the sample size. For the cell with observed frequency , for example, the expected value is . For the data in Table 10.10 the expected frequencies and contributions to are shown in Table 10.12. The difference is the same, apart from its sign, for all four cells, and this is true for all tables. This demonstrates that we have only one independent observation rather than four and so just one degree of freedom.

表10.12 表10.10数据的期望频数及对 的贡献
Table 10.12 Expected frequencies and contributions to for the data in Table 10.10

观察频数 (O)期望频数 (E)O-E(O-E)²
E
a = 32E(a) = 2571.960
b = 118E(b) = 125-70.392
c = 17E(c) = 24-72.042
d = 127E(d) = 12070.408
总计 2942940X²= 4.802
Observed frequency (O)Expected frequency (E)O-E(O-E)²
E
a = 32E(a) = 2571.960
b = 118E(b) = 125-70.392
c = 17E(c) = 24-72.042
d = 127E(d) = 12070.408
Total 2942940X²= 4.802

对于 表, 的公式可以简化。表中第一个单元格对 的贡献可以表示为
For a table the formula for can be simplified. The contribution from the first cell in the table to can be expressed as

对于其他三个单元格,我们可以得到类似的表达式。经过繁琐的推导,四项之和可以转化为
and we can produce similar expressions for the other three cells. The sum of the four terms, after much tedious manipulation, can be turned into

这个 公式版本常用于 表,因为它避免了显式计算期望值。需要理解的是,这个 表的 公式在数学上与通用公式 是完全相同的。
This version of the formula for is often used for tables, because it avoids the need to calculate the expected values explicitly. It is important to appreciate that this formula for from a table is mathematically identical to the general formula .

对于表10.10中的数据,我们得到
For the data in Table 10.10 we get

这与表10.12的结果一致。根据表B5,我们得到 ,表明
which agrees with Table 10.12. From Table B5 we get , suggesting

有证据支持游泳量与牙釉质侵蚀之间存在关联。
that there is evidence in support of an association between amount of swimming and erosion of dental enamel.

卡方检验是一种假设检验。它与第10.3.3节描述的两比例比较检验完全等价,但用这种方法分析数据时,无法得到组间差异的估计值(或置信区间)。因此,基于比例比较的方法更为可取。还有第三种比较比例的方法,即计算两组比例的比值而非差值。这种方法特别适合病例对照研究,详见第10.11节。
The Chi squared test is a hypothesis test. It is an exactly equivalent test to the comparison of two proportions described in section 10.3.3, but no estimate of the difference between the groups (or a confidence interval) is obtained when the data are analysed in this way. The approach based on comparing proportions is therefore preferable. There is a third way of comparing proportions, which involves calculating the ratio of proportions in two groups rather than their difference. This approach is particularly suitable for case- control studies and is described in section 10.11.

10.7.1 连续性校正 10.7.1 Continuity correction

当样本量较小时,使用连续的卡方分布来近似频数会引入一定偏差,使得 的值往往偏大。我们采用连续性校正来消除这种偏差,方法与两比例比较时相同(第10.3.3节)。在 列联表中,这种校正称为耶茨校正,以发明该方法的统计学家命名。
When the sample sizes are small the use of the continuous Chi squared distribution to approximate frequencies introduces some bias into the calculation, so that the value of tends to be a little too large. We use a continuity correction to remove the bias, in the same way as when comparing two proportions (section 10.3.3). In the context of tables the correction is known as Yates' correction after the statistician who devised it.

该校正通过将每个 向零方向移动 来实现。换言之,我们用 替代 。带有耶茨校正的简便公式为
The correction consists of moving each nearer to zero by . In other words we replace by . The short cut formula with Yates' correction becomes

我建议所有 列联表的卡方检验均使用此公式,尽管对于大样本来说,校正的影响较小。
I recommend that this formula is used for all Chi squared tests on tables, although for large samples the effect of the correction will be small.

对于牙釉质侵蚀数据,使用连续性校正得到
For the dental erosion data the use of the continuity correction gives

并且我们仍然有
and we still have

然而,对于小样本, 之间的差异更加显著。之前讨论的关于IRS与安慰剂在颈椎骨关节病患者中的试验数据将说明这一点;结果以频数表形式显示在表10.13中。未经校正的卡方检验结果为
For small samples, however, the difference between and is more marked. The data from the previously discussed trial of IRS placebo in patients with cervical osteoarthrosis will illustrate the effect; the results are shown as a frequency table in Table 10.13. The uncorrected Chi squared test gives

表10.13 IRS与安慰剂临床试验结果(Lewith 和 Machin,1981)
Table 10.13 Results of a clinical trial comparing IRS placebo (Lewith and Machin, 1981)

IRS安慰剂总计
疼痛改善9413
3912
总计121325
IRSPlaceboTotal
Improvement in painYes9413
No3912
Total121325

而使用Yates校正后得到
whereas the use of Yates' correction gives

这个例子展示了给出更精确 值的优势,而不是从表中获得的不精确值。许多计算机程序可以给出卡方检验的精确 值,本例中 分别对应的 。如第8.5节所述,我们不应仅因为 值跨越了0.05的界限就彻底改变解释,但当使用更合适的带Yates校正的检验时,关联证据会显得较弱。
This example shows the advantage of giving more exact values, rather than imprecise ones obtained from a table. Many computer programs give the precise values for Chi squared tests, which for and in this example are and respectively. As discussed in section 8.5, we should not make a radical adjustment to our interpretation just because the value has moved the other side of 0.05, but the evidence of an association is weaker when we use the more appropriate version of the test with Yates' correction.

这些结果与第10.3节中比较两组比例时得到的结果完全相同。如前所述,卡方方法仅给出一个 值,而比例比较还提供比例差及其置信区间。两种方法的数学等价性在第10.7.4节中进行了证明。因此,卡方检验等价于对列比例的比较,也等价于对行比例的比较。
These results are exactly the same as when the proportions in the two groups were compared in section 10.3. As noted, the Chi squared method yields only a value, whereas the comparison of proportions also yields the difference in proportions and its confidence interval. The mathematical equivalence of the two methods is demonstrated in section 10.7.4. It follows that the Chi squared test is equivalent to a comparison of the proportions in the columns and also a comparison of the proportions in the rows.

10.7.2 小样本—费舍尔精确检验 10.7.2 Small samples - Fisher's exact test

使用Yates校正并不能消除对期望频数大小的要求。根据之前的规则, 的单元格期望值应至少为5,因此要求 表格中的所有单元格满足此条件,尽管实际应用中该规则可以放宽,允许一个单元格的期望值略低于5。注意,尽管表10.13中有两个观察频数小于5,但所有期望频数均大于5。
The use of Yates' correction does not remove the requirement concerning the size of the expected frequencies. Using the earlier rule that of cells should have expected values of at least 5 we would require all cells of a table to have this property, although in practice this rule can be relaxed to allow one cell to have an expected value slightly lower than 5. Note that all the expected frequencies in Table 10.13 are greater than 5 even though two of the observed frequencies are less than 5.

对于期望频数非常小的表格,有一种替代方法,即以著名统计学家R. A. Fisher命名的费舍尔精确检验。
There is an alternative approach for tables with very small expected frequencies, known as Fisher's exact test after the famous statistician R. A.

虽然该方法在原理上不同于本章描述的其他方法,但它同样基于观察到的行和列总数。该方法通过计算所有具有相同行列总数的可能 表格出现的概率,假设原假设成立。这里的原假设仍是行变量和列变量无关联。与卡方检验类似,该方法纯粹是一个假设检验。
Fisher. Although the method is different in principle from any other described in this chapter, it is also based on the observed row and column totals. The method consists of evaluating the probability associated with all possible tables which have the same row and column totals as the observed data, making the assumption that the null hypothesis is true. As before, the null hypothesis here is that the row and column variables are unrelated. Like the Chi squared test, the method is purely a hypothesis test.

表10.14显示了一项比较少年违法者和对照组健康状况的研究数据。对于每组,列出了有视力缺陷的男孩数量,以及他们是否佩戴眼镜。我们可以检验原假设,即两组人群中佩戴眼镜的比例相同;换言之,少年违法者与其他男孩在意识到视力不良的可能性相等。四个单元格中有三个的期望值低于5,因此不应使用卡方检验,而应使用费舍尔精确检验,该检验对样本大小没有限制。
Table 10.14 shows data from a study comparing the health of juvenile delinquent boys and a control group. For each group, the number of boys with vision defects is shown, together with the numbers who did or did not wear spectacles (glasses). We can test the null hypothesis that the proportions wearing glasses in the population are the same; that is, that juvenile delinquents are equally likely to be aware of poor eyesight as other boys. The expected numbers in three of the four cells are below 5, so we should not use a Chi squared test, but we can use Fisher's exact test for which there is no sample size restriction.

表10.14 少年违法者和非违法者中视力测试未通过者的佩戴眼镜情况(Weindling等,1986)
Table 10.14 Spectacle wearing among juvenile delinquents and non-delinquents who failed a vision test (Weindling et al., 1986)

少年违法者非违法者总计
佩戴眼镜者156
8210
总计9716
Juvenile delinquentsNon-delinquentsTotal
Spectacle wearersYes156
No8210
Total9716

表10.15列出了所有可能的频数组合,这些组合的行列总数与观察到的数据相同,其中一组(表(ii))对应观察数据。对于每个表格,我们可以计算在原假设成立时出现该数据的概率。然后利用这些概率计算在原假设成立时获得观察数据或更不可能结果的总体概率。
Table 10.15 shows all the possible sets of frequencies which add up to the observed row and column totals, one of which (table (ii)) corresponds to the observed data. For each table we can calculate the probability of such data arising if the null hypothesis is true. We then use these probabilities to calculate the overall probability of getting the observed data, or a less likely result, when the null hypothesis is true.

计算每个概率的数学公式较为复杂,因此最好用计算机完成。遗憾的是,统计软件包通常不包含费舍尔精确检验,因此相关计算将在第10.7.3节中详细说明。
The mathematical formula to calculate each probability is rather complicated, so the calculation is much better done by a computer. Unfortunately, statistical packages do not include Fisher's exact test, so the calculations are described in section 10.7.3.

表10.16显示了表10.15中所有七组频数对应的概率。得到组间差异至少与观察到的差异一样大的总体概率
Table 10.16 shows the probabilities associated with all seven sets of frequencies shown in Table 10.15. The overall probability of obtaining a difference between the groups at least as large as the observed difference

表10.15 所有与表10.14具有相同行和列总数的频数表
Table 10.15 All tables of frequencies which have the same row and column totals as Table 10.14


(ii)
(ii)


(iii)
(iii)

表10.16 表10.15中每组频数对应的概率
Table 10.16 Probability associated with each set of frequencies in Table 10.15

abcdP
(i)06910.00087
(ii)15820.02360
(iii)24730.15734
(iv)33640.36713
(v)42550.33042
(vi)51460.11014
(vii)60370.01049
总计 0.99999
abcdP
(i)06910.00087
(ii)15820.02360
(iii)24730.15734
(iv)33640.36713
(v)42550.33042
(vi)51460.11014
(vii)60370.01049
Total 0.99999

当原假设成立时,可以用两种方法计算该概率。第一种是评估观察数据所在分布“尾部”的概率,然后将该值乘以2,得到双尾检验的概率。从表10.16中,我们使用表(i)和(ii)的概率得到 。另一种方法是将所有概率小于或等于观察数据对应概率的表的概率相加。
when the null hypothesis is true can be calculated in two ways. The first is to evaluate the probabilities in the 'tail' of the distribution in which the observed data fall and then double this value to get a two- tailed test. From Table 10.16 we use the probabilities for tables (i) and (ii) to get . Alternatively, we can add up the probabilities of all tables that have probabilities less than or equal to that

以本例为例,我们使用表(i)、(ii)和(vii)的概率相加,得到 。我认为第二种方法更合理,但许多统计学家推荐将单尾的 值乘以2。在大多数情况下,两者差异不大(但偶尔会有差异)。第二种方法得到的 值总是小于或等于第一种方法的值。以本例为例,我们可以得出结论,青少年犯罪者对视力问题的认识程度低于非犯罪者,这有一定证据支持。
corresponding to the observed data. For the example we use the probabilities for tables (i), (ii) and (vii) to get 0.035. I feel that the second approach is more reasonable, but many statisticians recommend doubling the value obtained for one tail. In most cases the difference will not be marked (but occasionally it can be). The second approach will always give a value of less than or equal to that obtained by the first method. In this example we can conclude that there is some evidence that juvenile delinquents are less aware of eyesight problems than non- delinquents.

最后应注意,费舍尔精确检验通常给出的 值与使用Yates校正的卡方检验结果相近,即使后者的期望频数过小也适用,这表明关于期望频数的限制可能过于严格。费舍尔精确检验纯粹是一个假设检验方法—对于非常小样本的比例比较,没有相应的估计方法。
Lastly, it should be noted that Fisher's exact test usually gives a value for that is much the same as that from a Chi squared test with Yates' correction even when the expected frequencies are too small for the latter approach, suggesting that the rule relating to expected frequencies is probably too restrictive. Fisher's exact test is purely a hypothesis test - there is no equivalent method of estimation for comparing proportions from very small samples.

10.7.3 费舍尔精确检验—数学原理与实例解析 10.7.3 Fisher's exact test - mathematics and worked example

(本节内容较为理论化,尽管不涉及高深的数学知识,仍可跳过而不影响整体连贯性。)
(This section is more theoretical although not highly mathematical. It can be omitted without loss of continuity.)

当零假设成立且行列总计固定时,获得单元格频数 的概率由下式给出:
The probability of obtaining the cell frequencies and when the null hypothesis is true and the row and column totals are fixed is given by

其中符号 ,称为“阶乘”,表示将从1到 的所有整数相乘(参见附录A)。例如,。(注意,我们定义 。)这个特殊公式来源于计算 个个体以不同方式(组合)排列在表中,从而得到观察到的行列总计的数量。表10.15展示了表10.14中视力数据的七种此类表格。
where the symbol , called factorial', means that we multiply together all the integers from 1 up to (see Appendix A). For example. . (Note that we need to define ) This peculiar formula is derived from calculating the number of different ways (combinations) in which the individuals can be arranged in a table to give the observed row and column totals. Table 10.15 shows the seven such tables for the eyesight data of Table 10.14.

对于第一种可能性(i),我们有 ,因此当零假设成立时,该表格出现的概率为
For the first possibility (i) we have , , and , so that the probability of this table arising by chance when the null hypothesis is true is

计算这个公式很繁琐。这个例子中计算涉及大约70个数字,先将分子所有数字相乘可能会超过计算器或计算机的存储能力。
Evaluating this formula is tedious. In this example there are some 70 numbers in the calculation, and multiplying together all the numbers in the top row first may exceed the storage capability of a calculator or computer. However, the calculation can usually be simplified by cancelling out

然而,通常可以通过约去公式分子和分母中相同的序列来简化计算。这里6!和9!可以直接删除,0!和1!都等于1,也可以省略,因此概率简化为
sequences that appear on the top and bottom of the formula. Here 6! and 9! can be deleted immediately, and we can omit 0! and 1! as they are both equal to 1, so that the probability reduces to

对于表格(ii),对应于观察到的数据,我们得到的概率是
For table (ii), which corresponds to the observed data, we get a probability of

我们可以简化这个表达式,注意到 ,依此类推,得到
We can simplify this expression by noting that , and so on, to get

为了执行 Fisher 精确检验,我们对所有表格进行相同的计算,如表 10.16 所示。我们本可以只计算那些对概率分布尾部有贡献的表格的概率,但事先识别这些表格并不容易。计算机程序的优势在此显而易见。
To perform Fisher's exact test we carry out the same calculation for all tables, as shown in Table 10.16. We could just calculate the probability for those tables which contribute to the tail(s) of the distribution of probabilities, but it is not easy to identify these in advance. The benefit of a computer program is clearly seen.

10.7.4 比例比较与卡方检验的等价性 10.7.4 Equivalence of the comparison of proportions and the Chi squared test

(本节较为理论,虽然数学不复杂,可在不影响连贯性的情况下略过。)
(This section is more theoretical although not highly mathematical. It can be omitted without loss of continuity.)

我多次指出,比较两个独立比例的方法与 表的卡方检验是相同的。通过使用表 10.11 的符号表达两个比例的比较,可以数学上证明这一点。设 ,合并比例为 ,那么比较两个观察比例的 值为
I have commented more than once that the method for comparing two independent proportions is identical to the Chi squared test for a table. This can be shown mathematically, by expressing the comparison of two proportions in the notation of Table 10.11. We have , , and the pooled proportion is , so that the value of for comparing the two observed proportions is

经过一些变换,得到
which, after some manipulation, gives

因此, 的值是 值的平方根,这两个检验是等价的,因为如第10.6.3节所述,自由度为1的卡方分布是标准正态分布的平方。
The value of is thus the square root of the value of , and the two tests are equivalent because, as noted in section 10.6.3, the Chi squared distribution with one degree of freedom is the square of the standard Normal distribution.

10.7.5 表 - 配对样本 10.7.5 tables - paired samples

配对比例也可以用一个 表来表示。例如,表10.3中的数据可以重新排列成表10.17。虽然该表与比较两个独立比例的表格(如表10.10、10.13和10.14)非常相似,但必须记住比例是配对的,因此常规的卡方检验不适用。
Paired proportions may also be shown as a table. For example, the data in Table 10.3 can be rearranged as in Table 10.17. Although the table closely resembles those relating to the comparison of two independent proportions, such as Tables 10.10, 10.13 and 10.14, it is essential to remember that the proportions are paired and so the usual Chi squared test is inappropriate.

表10.17 将表10.3的结果重新排列成一个 表,显示大麻使用者和匹配对照组中有睡眠困难(+)或无睡眠困难(-)的人数
Table 10.17 Results of Table 10.3 rearranged as a table, showing numbers with or without sleeping difficulties among marijuana users and matched controls

大麻组
+-总计
对照组+4913
-31619
总计72532
Marijuana group
+-Total
Control group+4913
-31619
Total72532

配对比例的比较基于结果不同的配对频数,正如我们在第10.4节中看到的,那里描述了置信区间和假设检验。在第10.4.3节中,给出了包含连续性校正的检验统计量:
The comparison of paired proportions is based on the frequencies of pairs with different outcomes, as we saw in section 10.4 where the confidence interval and hypothesis test were described. In section 10.4.3 the test statistic incorporating the continuity correction was given as

有时检验统计量的计算略有不同,如下:
Sometimes the test statistic is calculated slightly differently as

显然,这等于 的值参照自由度为1的卡方分布。正如我们在独立比例的情况中看到的,这两个检验完全等价。
which is clearly equal to . The value of is referred to the Chi squared distribution with one degree of freedom. As we have seen for independent proportions these two tests are exactly equivalent.

配对比例检验通常被称为McNemar检验,尤其当数据以 表格形式呈现时。
The test of paired proportions is often known as McNemar's test, especially when the data are shown as a table.

10.8 表格—多个比例的比较 10.8 TABLES - COMPARISON OF SEVERAL PROPORTIONS

如前所述,来自两个以上组的比例的统计比较取决于定义组别的类别是否有序。
As indicated earlier, the statistical comparison of proportions derived from more than two groups differs according to whether the categories defining the groups are ordered or not.

表格作为特殊情况讨论,有助于更容易地探讨多重比较问题,并考虑有序组的特殊情况。此外,还提供了便于手工计算的“捷径”公式。
Discussing tables as a special case makes it rather easier to discuss the problems of multiple comparisons, and to consider the special situation of ordered groups. Also there is a 'short- cut' formula available for hand calculations.

10.8.1 无序类别 10.8.1 Unordered categories

表10.18显示了四类办公室工作人员报告的眼睛疲劳情况。数据来源于一项评估使用视觉显示单元(VDU,即电脑显示器)可能带来有害影响的研究。原假设是四组报告眼睛疲劳的比例无差异。
Table 10.18 shows reported eye strain for four types of office workers. The data are from a study carried out to assess possible harmful effects of using visual display units (VDUs) (i.e. computer monitors). The null hypothesis is that there is no difference in the proportions reporting eye strain in the four groups.

无序类别比例的分析可以基于
Analysis of proportions from unordered categories can be based on

表10.18 四类办公室工作人员报告的眼睛疲劳(Reading 和 Weale,1986)
Table 10.18 Eye strain reported by four groups of office workers (Reading and Weale, 1986)

工作类型样本数量报告眼睛疲劳人数眼睛疲劳比例
VDU数据录入53110.208
VDU会话使用109300.275
全职打字78140.179
传统办公室工作(文书)5530.055
总计295580.197
Type of workNumber in sampleNumber with eye strainProportion with eye strain
Data entry in VDUs53110.208
Conversational use of VDUs109300.275
Full-time typing78140.179
Traditional office work (clerical)5530.055
Total295580.197

根据之前给出的通用公式计算 。另一种表达方式如下。如果第 组有 名受试者,其中 人具有感兴趣的特征,则 可计算为
calculation of according to the general formula previously given: . An alternative formulation is as follows. If there are subjects in group , of whom have the characteristic of interest, we can calculate as

其中,(R) 是具有该特征的总人数 ((R = \Sigma r_{i})),(N) 是样本总量,且 (P = R / N)。我们将 (X^{2}) 与自由度为 (k - 1) 的卡方分布进行比较。对于表10.18中的数据,我们有
where is the total number with the characteristic , is the total sample size, and . We compare to the Chi squared distribution with degrees of freedom. For the data in Table 10.18, we have

根据自由度为3的卡方分布表(表B5),对应的 ( \mathbf{P} < 0.01 )(精确值为 (\mathbf{P} = 0.0094))。因此,有强有力的证据表明眼睛疲劳在这四组中分布不均。
which, from the table of Chi squared with 3 degrees of freedom (Table B5), corresponds to (the exact value is . There is thus strong evidence that eye strain is not equally common in all four groups.

对于组间这种高度显著差异的解释,取决于确定哪些组与其他组不同。如果没有任何先验假设,比较每对组需要进行六次额外的检验,且假阳性风险较高,除非我们调整 (\mathbf{P}) 值。更好的方法通常是合并或“折叠”某些组。由于本研究旨在考察使用VDU可能带来的健康不良影响,因此可以合理地将两个VDU组合并,并与其他两个组分别比较。在两个VDU组合并的受试者中,41/162(0.253)出现了眼睛疲劳。
Interpretation of this highly significant variation among the groups depends upon isolating which groups differ from the others. In the absence of any prior hypothesis comparison of each pair of groups requires six further tests, and the risk of a false positive result is high unless we adjust the values. A better approach is often to combine, or 'collapse', some groups. As this study was carried out to examine the possible adverse health effect of using VDUs, the two VDU groups can reasonably be combined and compared with each of the other two groups. Of the subjects in the two VDU groups combined, 41/162 (0.253) had eye strain.

三组配对比较的结果(带有Yates校正的 (2 \times 2) 表)如下:
The results of the three paired comparisons ( tables with Yates' correction) are as follows:

比较P
VDU vs 打字:1.220.27
VDU vs 传统:8.820.003
打字 vs 传统:3.470.06
ComparisonP
VDU v Typing:1.220.27
VDU v Traditional:8.820.003
Typing v Traditional:3.470.06

因此,打字和使用VDU,尤其是后者,都与比传统文书工作更多的眼睛疲劳相关。我们可能需要将 (\mathbf{P}) 值乘以三(Bonferroni校正)以考虑多重比较。不论哪种方式,本研究均无证据表明使用VDU比打字更容易引起眼睛疲劳。
It seems, therefore, that both typing and using a VDU, especially the latter, are associated with more eye strain than traditional clerical offi

work. We probably should multiply the values by three (the Bonferroni correction) to allow for the multiple comparisons. Either way there is no evidence from this study to suggest that use of a VDU is associated with more eye strain than typing.

10.8.2 有序类别 10.8.2 Ordered categories

当我们希望比较具有顺序关系的组间频率或比例时,应利用这种顺序来提高统计分析的效能。上一节描述的方法评估了观察数据与“各组相同”的原假设的偏离,但并未考虑任何特定的顺序。当组是有序时,我们通常期望组间的差异与该顺序相关。忽视组的顺序是一个常见的统计错误(Moses 等,1984)。下面介绍两种主要的分析方法。
When we wish to compare frequencies or proportions among groups which have an ordering, we should make use of the ordering to increase the power of the statistical analysis. The method described in the previous section assesses departure of the observed data from the null hypothesis that the groups are the same, but in no particular manner. When the groups are ordered we usually expect any differences among the groups to be related to the ordering. Failure to take account of the ordering of groups is a common statistical error (Moses et al., 1984). Two main possible analyses are described below.

(a) 趋势卡方检验 (a) Chi squared test for trend

我们可以将组间的变异分解为比例随组变化的趋势部分和剩余部分。虽然趋势的 值总是小于总体比较的 ,但趋势卡方检验是一种强有力的分析方法,因为它的检验统计量服从自由度为1的卡方分布,而通常的卡方检验自由度为 。如果大部分变异来源于组间的趋势,则趋势检验会产生更小的 值。
We can subdivide variation among groups into that due to a trend in proportions across the groups and the remainder. Although the value of for trend will always be less than for the overall comparison, the Chi squared test for trend is a powerful method of analysis because it yields a test statistic from a Chi squared distribution with one degree of freedom rather than degrees of freedom for the usual Chi squared test. If most of the variation is due to a trend across the groups, then the test for trend will yield a much smaller value.

下面用表10.19(已在表10.1中展示)中的数据说明趋势检验,该数据涉及剖宫产婴儿的频率与母亲鞋码的关系。该研究的理论基础是小鞋码可能是骨盆小导致分娩困难的简单指标。为了保证所有单元格的期望频数充足,大鞋码的数据已合并。该 表的标准卡方检验结果为 ,自由度为5,对应
The test for trend will be illustrated using the data in Table 10.19 (already shown as Table 10.1) relating the frequency of babies delivered by Caesarean section to maternal shoe size. The rationale for this study was that small shoe size is a simple indicator of possible birth difficulty due to a small pelvis. The data for larger shoe sizes have been amalgamated to give adequate expected numbers in all cells. The standard Chi squared test of this table give with 5 degrees of freedom, for which .

表10.19 剖宫产频率与母亲鞋码的关系(Frame 等,1985)
Table 10.19 Relation between frequency of Caesarean section and maternal shoe size (Frame et al., 1985)

剖宫产鞋码
< 444 1/255 1/26+总计
576781043
1728364146140308
总计2235424854150351
Caesarean sectionShoe size
&lt; 444 1/255 1/26+Total
Yes576781043
No1728364146140308
Total2235424854150351


图10.3 不同鞋码组中剖宫产比例
Figure 10.3 Proportions of women having a baby by Caesarean section in different shoe size groups.

表10.19显示了每个鞋码组中剖宫产的女性人数,我们可以据此计算各组的比例,并在图10.3中以图形方式展示。评估趋势的方法实质是对比例拟合一条直线,检验该线的斜率是否显著不同于零(零斜率表示水平线)。需考虑每个比例基于的女性人数不同。拟合此类直线的方法称为回归分析,详见第11.3节,但我们也可通过基于观察频数的计算获得相同结果。该分析得出自由度为1的趋势检验统计量 。计算过程稍后介绍。
Table 10.19 shows the numbers of women having a baby by Caesarean section in each shoe size group, from which we can obtain the proportions in each group, shown graphically in Figure 10.3. The method for evaluating a trend is effectively to fit a straight line to the proportions, and see if the slope of the line is significantly different from zero (which represents a horizontal line). We need to take account of the fact that each proportion is based on different numbers of women. The method for fitting such a line is called regression analysis, and is not described until section 11.3, but we can obtain the same result by a calculation based on the observed frequencies. From this analysis we get a value of the test statistic on one degree of freedom. The calculations are described later.

为进行此检验,必须为每组分配分值。如果变量具有明确的定量含义,可根据组的定义确定分值。例如,鞋码数据可赋值为3.5、4.0、4.5、5.0、5.5和6.0(或等价的1、2、3、4、5、6)。原假设为组间无趋势。若分值间距相等,则观察到的趋势称为线性趋势。
In order to carry out this test we have to assign scores to each group. If the variable has a clear quantitative interpretation we can derive the scores from the definition of the groups. For example, the shoe size data can be scored 3.5, 4.0, 4.5, 5.0, 5.5 and 6.0 (or, equivalently, 1, 2, 3, 4, 5 and 6). The null hypothesis is now that there is no trend across groups. If the scores are equally spaced we refer to an observed trend as a linear trend.

对剖宫产数据的分析得出 ,自由度为1,。因此,有强烈证据表明剖宫产比例与鞋码呈线性趋势。当然,这种关系并非直接因果,不应做此类解释。鞋码在此仅作为骨盆小的便利指标。
Analysis of the Caesarean section data gives on 1 degree of freedom . There is thus strong evidence of a linear trend in the proportion of women giving birth by Caesarean section in relation to shoe size. This relation is not directly causal, of course, and no such interpretation should be made. Shoe size is here a convenient indicator of small pelvic size.

总体的 值为 9.29,自由度为 5。我们可以减去 的值(8.02),得到一个检验无除趋势外无其他变异的零假设的卡方检验。这里得到 ,自由度为 4,远未达到统计显著性水平。我们可以得出结论,组间观察到的所有变异都可归因于线性趋势。
The overall value of was 9.29 on 5 degrees of freedom. We can subtract the value of (8.02) to get a Chi squared test of the null hypothesis of no variation other than that due to trend. Here we get on 4 degrees of freedom, which is nowhere near to statistical significance. We can conclude that all the observed variation between the groups can be attributed to a linear trend.

注意,尽管线性趋势高度显著,如果我们试图用鞋码预测哪些女性需要剖宫产,大多数情况下是错误的。这类问题将在第14.4节讨论。
Note that, although the linear trend is highly significant, if we tried to use shoe size to predict which women would require a Caesarean section we would be wrong most of the time. This type of problem is considered in section 14.4.

(b) 方法及示例 (b) Method and worked example

计算 最简单的方法是使用一个掩盖了方法本质的公式。Fleiss(1981,第144页)展示了利用回归方法推导该公式(另见第11.15.2节)。
The simplest way to calculate is by means of a formula that disguises the nature of the method. Fleiss (1981, p. 144) shows the derivation of the formula using the regression approach (see also section 11.15.2).

对于第 组,我们将具有某特征的观察频数记为 ,总个体数记为 。此外,令 为分配给第 组的分值。然后定义以下简化量:
For group we will call the observed frequency with a characteristic and the total number of individuals . Further, we let be the score allocated to group . Then we define some simplifying quantities as follows:

检验统计量 计算公式为:
The test statistic is then obtained as

表10.20 计算表10.19数据的
Table 10.20 Calculation of for the data in Table 10.19

鞋码
<4456+总计
剖宫产 (ri)576781043 (= R)
总数 (ni)2235424854150351 (= N)
分值 (xi)123456
rixi51418284060165
nixi22701261922709001580
nixi^222140378768135054008058
Shoe size
&lt;4441/2551/26+Total
Caesarean section (ri)576781043 (= R)
Total (ni)2235424854150351 (= N)
Score (xi)123456
rixi51418284060165
nixi22701261922709001580
nixi222140378768135054008058


表10.20展示了剖宫产数据的基本计算。由这些元素我们得到
Table 10.20 shows the basic calculation for the Caesarean section data. From these elements we get

(c) 有序类别组 (c) Qualitatively ordered groups

我们经常遇到明显有序的组数据,但这些组没有潜在的度量尺度,或者该尺度无法量化。此类变量的例子包括社会阶层和以“轻度”、“中度”或“重度”记录的疼痛。除非有相反的指示,通常合理地给这些组赋予等距分数,并按线性趋势计算
We often have data from groups which are clearly ordered, but where there is either no underlying scale of measurement or such a scale cannot be quantified. Examples of these two types of variable are social class and pain recorded as 'mild', 'moderate' or 'severe'. In the absence of any indication to the contrary it is generally reasonable to give such groups equally spaced scores and evaluate as if for a linear trend.

然而,有时认为采用不同的分数间距更为合适。例如,Norton 和 Dunn(1985)进行了一项调查,将打鼾频率与各种疾病相关联。根据配偶的报告,受试者被分为不打鼾者、偶尔打鼾者、几乎每晚打鼾者和每晚打鼾者四组。表10.21显示了打鼾与心脏病的关系。作者使用分数1、3、5和6对四个打鼾组进行了趋势卡方检验。组间总体比较的卡方值为 ,自由度为3,而趋势检验的卡方值为 ,自由度为1。两者均高度显著。显然,组间所有差异均可归因于趋势,即打鼾频率与心脏病患病率之间存在强烈关联。
Sometimes, however, it is felt that a different spacing of scores is appropriate. For example, Norton and Dunn (1985) carried out a survey in which they related frequency of snoring to various medical conditions. Subjects were categorized as either non- snorers, occasional snorers, those who snored nearly every night, and those who snored every night, on the basis of their spouses' reports. Table 10.21 shows data relating snoring to heart disease. The authors performed a Chi squared test for trend using scores of 1, 3, 5 and 6 for the four snoring groups. The overall comparison of the groups gives on 3 degrees of freedom while the trend test gives on 1 degree of freedom. Both of these are very highly significant. It is clear that all of the differences between the groups can be attributed to the trend. That is, there is a strong association between frequency of snoring and prevalence of heart disease.

本研究中使用的分数与等距分数差别不大—考虑到组的描述,或许1、2、5和6更合理。实际上,分数的小幅差异对检验统计量影响不大。当然,分数的确定不应
The scores used in this study were not very different from equal spacing - given the descriptions of the groups perhaps 1, 2, 5 and 6 would have been more reasonable. In practice small differences in scoring are unlikely to have much effect on the test statistic. Of course, the scores should not

表10.21 打鼾行为与心脏病有无的关系(Norton 和 Dunn,1985)
Table 10.21 Snoring behaviour in relation to presence or absence of heart disease (Norton and Dunn, 1985)

心脏病不打鼾者偶尔打鼾者几乎每晚打鼾者每晚打鼾者总计
24 (1.7%)35 (5.5%)21 (9.9%)30 (11.8%)110 (4.2%)
13556031922242374
总计13796382132542484
Heart diseaseNon-snorersOccasional snorersSnore nearly every nightSnore every nightTotal
Yes24 (1.7%)35 (5.5%)21 (9.9%)30 (11.8%)110 (4.2%)
No13556031922242374
Total13796382132542484

基于数据而定,而应基于先验考虑。
be decided on the basis of the data but on prior considerations.

(d)替代方法—Mann-Whitney检验 (d) Alternative approach - the Mann-Whitney test

对有序组的频数数据的另一种处理方法是将数据视为两个有序等级的样本观测值。例如,在表10.19中,两个样本分别是接受剖宫产的女性和未接受剖宫产的女性。我们可以给有序组赋予等级1、2、3……,然后使用Mann-Whitney检验(第9.6.4节中描述)比较有无该特征的受试者的等级。当然,这类数据中存在大量并列等级,因为不同的取值较少,因此必须使用带有并列等级校正的检验版本。许多统计软件包都能执行此检验。
A different approach to frequency data from ordered groups is to treat the data as two samples of observations on an ordinal scale. For example, in Table 10.19 the two samples are women who had a Caesarean section and those who did not. We can give ranks 1, 2, 3, …, etc. to the ordered groups, and then compare the ranks for the subjects with or without the characteristic of interest using the Mann- Whitney test (described in section 9.6.4). There are, of course, vast numbers of tied ranks in data of this type because there are few different values, so it is essential to use the version of the test with a correction for ties. Many statistical packages can perform this test.

一般来说,Mann-Whitney检验的结果与趋势卡方检验非常相似。例如,对于表10.19中的剖宫产数据,Mann-Whitney检验得到 ),而趋势卡方检验得到
In general the Mann- Whitney test gives a very similar answer to the Chi squared test for trend. For example, for the Caesarean section data of Table 10.19 we get compared with from the Chi squared test for trend.

10.9 含有有序类别的大型表格 10.9 LARGE TABLES WITH ORDERED CATEGORIES

在分析 表时,我们应始终考虑类别的顺序,对于更大表格亦应如此。这里有两种情况:一是行变量或列变量有序,二是行列变量均有序。
We should always take account of ordering in analysis of tables, and we should do likewise for larger tables. There are two cases to consider: where either the row or column variable is ordered and where both are ordered.

10.9.1 一个有序变量 10.9.1 One ordered variable

当受试者被一个有序变量分为三组或更多组时,我们可以使用Kruskall-Wallis检验(第9.8.6节)比较各组。如果组间差异显著,可以用Mann-Whitney检验进行两两比较。必须使用带有并列等级调整的检验版本。
With three or more groups of subjects classified by an ordinal variable we can use the Kruskall- Wallis test (section 9.8.6) to compare the groups. If the groups differ significantly we can use the Mann- Whitney test to compare pairs of groups. It is essential to use the versions of the tests that adjust for tied ranks.

没有任何变量有序的大型频数表较为罕见,这也是为何用有序的咖啡因数据(表10.5)来说明一般的 卡方分析。
Large frequency tables in which neither variable is ordered are rare, which is why the ordered caffeine data (Table 10.5) were used to illustrate the general Chi squared analysis of an table.

10.9.2 两个有序变量 10.9.2 Two ordered variables

分析两个有序变量关系的最简单方法是计算它们之间的等级相关系数。此方法将在第11.7.2节中介绍。然而,当我们希望比较一个样本(或配对样本)中两个或多个配对的有序变量时,适用的分析方法是配对Wilcoxon检验,该检验在第9.7.2节中已描述。
The simplest way to analyse the relation between two ordered variables is to calculate the rank correlation between them. This method will be described in section 11.7.2. However, the appropriate analysis when we wish to compare two or more paired ordinal variables on one sample (or matched samples) is the paired Wilcoxon test, which was described in section 9.7.2.

10.10 表格—匹配变量的分析 10.10 TABLES - ANALYSIS OF MATCHED VARIABLES

有时我们会获得同一受试者的匹配分类对。例如,我们可能希望比较治疗前后的疼痛程度。最简单的情况是受试者被分为两个组;我们使用正态方法或 McNemar 检验来比较配对比例,详见第10.4节和第10.7.5节。当有三个组时,McNemar 检验有一个扩展版本,称为 Stuart-Maxwell 检验(详见 Fleiss (1981, p. 119))。
Sometimes we obtain matched pairs of categorizations of the same subjects. For example, we may wish to compare degrees of pain before and after treatment. The simplest case is when subjects are classified into just two groups; we use the Normal method or the McNemar test to compare the paired proportions, as described in sections 10.4 and 10.7.5. When there are three groups there is an extension of the McNemar test known as the Stuart- Maxwell test (see Fleiss (1981, p. 119) for a description).

对于三个或更多有序类别的配对变量,适用 Wilcoxon 配对符号秩检验(见第9.7.2节)。
With three or more ordered categories for paired variables the Wilcoxon matched pairs signed ranks test is appropriate (see section 9.7.2).

一个相关的问题是评估两种分类方法的一致性;例如,我们可能希望比较两位组织学家对一系列活检样本疾病分期的分类。观察者间比较详见第14章。
A related problem occurs when we wish to assess how well two classifications agree; for example, we may wish to compare the way that two histologists classify stage of disease in a series of biopsy samples. The comparison of observers is described in Chapter 14.

10.11 风险比较 10.11 COMPARING RISKS

还有另一种分析2×2表格的方法,涉及比较两组在某事件风险上的差异。这些方法最初在流行病学中发展,特别用于病例对照研究的分析,但其应用日益广泛。这里仅考虑两组受试者和两种结果类型的情况,虽有扩展存在。
There is yet another way of analysing two by two tables, which involves the comparison of two groups with respect to the risk of some event. The methods were developed in epidemiology, especially for the analysis of case- control studies, but their use is becoming more widespread. I shall consider only the case where there are two groups of subjects and only two types of outcome, although extensions exist.

10.11.1 前瞻性研究—估计相对风险 10.11.1 Prospective study - estimating relative risk

在前瞻性研究中,具有不同特征的受试者组被随访以观察感兴趣的结果是否发生。许多临床试验属于此类,观察性研究中无法随机分配的特征(如血型)也属此类。我们可以轻松计算各组中发生结果的比例,这两个比例的比值即为一组相对于另一组的风险提升程度,称为相对风险。表10.22展示了此情形下的2×2表格布局。两组的风险分别为 ,相对风险因此为
In a prospective study groups of subjects with different characteristics are followed up to see whether an outcome of interest occurs. Many clinical trials are like this, but so too are observational studies where it is not possible to randomize the feature of interest, such as blood group. We can easily calculate the proportions having the outcome in each group, and so the ratio of these two proportions is a measure of the raised risk in one group compared to the other. We term this ratio the relative risk. Table 10.22 shows the general layout of the table that arises in this situation. The risks in the two groups are and , and the relative risk is thus

在原假设下,RR的期望值为1。
Under the null hypothesis the expected value of RR is 1.

表10.23展示了一项针对107名“胎龄偏小”婴儿的研究结果,即根据已发布的标准,这些婴儿的出生体重低于其孕周的第五百分位数。婴儿根据超声检查被分类为
Table 10.23 shows the results of a study of 107 'small- for- dates' babies. that is, of babies whose birth weight was below the fifth centile for their length of gestation using published standards. The babies were classified as

表10.22 以表格形式一般表示前瞻性研究结果
Table 10.22 General representation of the results of a prospective study as a table

组别 1组别 2总计
结局出现aba + b
cdc + d
总计a + cb + dn
Group 1Group 2Total
Outcome presentYesaba + b
Nocdc + d
Totala + cb + dn

表10.23 Apgar评分低于7与对称性或非对称性胎儿生长受限的关系(Kurjak等,1978)
Table 10.23 Relation between Apgar score and symmetric or asymmetric fetal growth retardation (Kurjak et al., 1978)

对称性非对称性总计
Apgar < 723335
145872
总计1691107
SymmetricAsymmetricTotal
Apgar &lt; 7Yes23335
No145872
Total1691107

根据超声检查,婴儿被分类为“对称性”或“非对称性”生长受限,该分类与Apgar评分相关,Apgar评分是对其健康状况的评估(见2.4.4节)。
having either 'symmetric' or 'asymmetric' growth retardation on the basis of the ultrasound examination, and this classification is shown in relation to their Apgar score which is an assessment of their well- being (see section 2.4.4).

Apgar评分低于7的比例在对称组为2/16(0.13),在非对称组为33/91(0.36)。因此,低Apgar评分的相对风险为
The proportions with an Apgar score less than 7 were 2/16 (0.13) in the symmetric group and 33/91 (0.36) in the asymmetric group. The relative risk of a low Apgar score is thus

即,对称组的风险约为非对称组的 35%。
that is, the risk in the symmetric group is about of that in the asymmetric group.

我们可以使用以下公式计算相对风险对数的标准误,从而构建相对风险的置信区间:
We can construct a confidence interval for the relative risk using the following formula for the standard error of its logarithm:

(\log RR) 的抽样分布服从正态分布,因此我们可以构建,例如,90%的相对风险对数置信区间:
The sampling distribution of is the Normal distribution, so we can construct, say, a confidence interval for the log of the relative risk as

其中 (N_{0.95}) 是正态分布中相应的临界值。相对风险的置信区间通过对这些值取反对数得到。
where is the appropriate value from the Normal distribution. The confidence interval for the relative risk is obtained by antilogging these values.

在本例中,相对风险为 0.345,其自然对数为 -1.0651。该值的标准误为
In the example, the relative risk was 0.345 and its logarithm (to base e) is - 1.0651. The standard error of this value is

因此,我们可以得到所有此类婴儿总体中相对风险对数的 90% 置信区间为
Thus we can obtain the confidence interval for the log of the relative risk in the population of all such babies as

比值比(or)为 -2.177 到 0.047,给出了相对风险的 90% 置信区间为 0.11 到 1.05。置信区间较宽是因为对称组的样本量较小。
or - 2.177 to 0.047, giving a confidence interval for the relative risk of 0.11 to 1.05. The confidence interval is wide because the sample size in the symmetric group is small.

这种比较两个比例的方法基于它们的比值,而第10.3节描述的方法则基于它们的差值。一般来说,相对风险在流行病学研究中更常用,尽管它也可以(或许应该)更多地用于临床数据分析。比较组的第三种方法是通过下一节描述的比值比(odds ratio)。
This approach to the comparison of two proportions is based on their ratio, whereas the method described in section 10.3 is related to their difference. In general the relative risk is more frequently used in epidemiological work, although it could (and perhaps should) be used more for the analysis of clinical data. A third way of comparing groups is via the odds ratio described in the next section.

10.11.2 回顾性研究—比值比 10.11.2 Retrospective study - the odds ratio

(a) 两个样本 (a) Two samples

在回顾性病例对照研究中,我们仍然可以像表10.22那样安排数据,但有一个重要区别。受试者的选择基于结果(行),而在前瞻性研究中则基于定义组别的特征(列)。由于受试者的抽样方式,我们无法评估有无该特征者的结果风险。显然,通过改变选择研究的病例和对照人数,我们可以得到任意的风险值,因此相对风险不是有效的估计。我们需要基于每组内部计算的方法。我们可以使用比值 a/c,即第一组中结果的赔率。例如,如果第一组中具有该特征的比例是 2/20,则该组中特征的赔率是 2 比 18,即 1/9。因此,两组赔率的比值,即比值比,是比较组的另一种方法。
In retrospective case- control studies, we can still arrange the data in a table like Table 10.22, but there is an important difference. The selection of subjects is based on the outcome (the rows) whereas in a prospective study it is based on the characteristic defining the groups (the columns). We cannot evaluate the risk of the outcome in those with and without the characteristic because of the way the subjects were sampled. It is clear that we can get any value we like for the risk by varying the number of cases and controls that we choose to study, and so the relative risk is not a valid estimate. We need a method based on calculations within each group. We can use the ratio , which is the odds of the outcome in the first group. Thus, for example, if the proportion with the feature in group 1 is , the odds of the feature in that group are 2 to 18 or 1/9. So the ratio of the odds in the two groups, called the odds ratio, is another way of comparing groups.

如果定义病例的感兴趣结果较罕见,则 a 会很小,且 a/(a+c) 约等于 a/c。同理,b 也会很小,且 b/(b+d) 约等于 b/d。因此,相对风险大约等于 (a/c)/(b/d) 或 ad/bc。对于病例对照研究,感兴趣的结果通常较罕见,所以比值比提供了一种尽管样本选择方式不同,但仍能获得近似相对风险的方法。
If the outcome of interest that defines the cases is rare, then a will b small and will be approximately equal to . Similarly will b small and will be approximately equal to . Thus the relative risk will be approximately equal to or ad/bc. For case- control studies the outcome of interest is usually rare, so the odds ratio offers a method of getting an approximate relative risk despite the method of sample selection.

比值比定义为 OR = ad/bc。置信区间的计算方法与相对风险类似。我们使用比值比对数的标准误,计算公式为
The odds ratio is defined as . A confidence interval can be obtained in a similar manner as for the relative risk. We use the standard error of the logarithm of the odds ratio, given by

因此,比值比对数的 95% 置信区间为
so that a confidence interval for the log odds ratio is obtained as

其中 是正态分布中相应的值。当 表中的四个单元格均不太小时,该方法适用;否则需要更先进的方法(参见 Breslow 和 Day,1980,第124页)。
where is the appropriate value from the Normal distribution. This method is suitable when none of the four cells in the table is very small; otherwise more advanced methods are needed (see Breslow and Day, 1980, p. 124).

表10.10显示了一项病例对照研究结果,研究氯化泳池游泳时间与牙釉质侵蚀的关系。该表的优势比为 ,因此
Table 10.10 showed the results of a case- control study of erosion of dental enamel in relation to amount of swimming in a chlorinated pool. The odds ratio for that table is ,so . Also, we have

因此,人口对数优势比的95%置信区间为
The confidence interval for the population log odds ratio is thus

或者从0.067到1.345。因此,优势比的95%置信区间为 ,即从1.069到3.840。由于整个区间均大于1(1表示两组风险或优势相等),我们可以推断每周游泳超过6小时的人群牙釉质侵蚀风险升高。
or from 0.067 to 1.345. Thus the confidence interval for the odds ratio is from to , or from 1.069 to 3.840. As the whole of the interval is greater than 1, which indicates equal risk (or odds) in the two groups, we can infer that there is a raised risk of erosion in dental enamel among those swimming more than 6 hours per week.

(b) 配对样本 (b) Paired samples

对于配对样本,我们需要不同的方法,需关注配对间的差异。表10.24展示了匹配对病例对照研究的一般结构;表10.17给出了具体例子,虽然分析方法不同。与之前分析配对比例的方法类似,关键在于暴露情况不同的配对数,即 。优势比简单由这两个频数计算得出为 ,即在 对中仅一方暴露的情况下,病例的优势。Morris 和 Gardner(1989)及 Fleiss(1981,第112页)给出了置信区间的计算方法,以及多对一病例控制的情况处理方法。
With paired samples we need a different approach, which requires us to look at the differences between pairs. Table 10.24 shows the general structure of a matched pair case- control study; a specific example was given in Table 10.17, although the method of analysis was different. As with the earlier methods for analysing paired proportions, it is the numbers of pairs where the exposures differ that are of interest, that is, and The odds ratio is derived simply from these frequencies as , this being the odds of being a case among the pairs in which only one individual was exposed. A method for calculating a confidence interval is given by Morris and Gardner (1989) and Fleiss (1981, p. 112), as are methods for the situation where there are several controls for each case.

表10.25显示了一项关于压力与乳腺癌复发的病例对照研究数据。50名女性在治疗后首次复发。
Table 10.25 shows data from a case- control study of stress and relapse of breast cancer. Fifty women developing a first recurrence after treatment

表10.24 匹配对病例对照研究结果的一般结构。暴露于可能风险因素的有无分别用 或 - 表示。
Table 10.24 General structure of the results from a matched pair case-control study. Presence or absence of exposure to the possible risk are denoted by or -

病例
+-总计
对照+aba + b
-cdc + d
总计a + cb + dn
Cases
+-Total
Controls+aba + b
+cdc + d
Totala + cb + dn

表10.25 匹配病例对照研究结果:乳腺癌复发与经历过()或未经历过()压力性生活事件的女性(Ramirez等,1989)
Table 10.25 Results of a matched case-control study of a relapse from breast cancer among women who did or did not experience a stressful life event (Ramirez et al., 1989)

病例
+-总计
对照+9312
-172138
总计262450
Cases
+-Total
Controls+9312
+172138
Total262450

确定了可手术治疗的乳腺癌患者。利用大型数据库,针对多个预后因素(包括手术类型和辅助化疗的使用)及社会人口学因素,将她们与50名未复发的女性逐一匹配。对于每位复发患者(病例),收集了手术至复发期间是否经历过压力性生活事件(如离婚或配偶去世)的数据;对于匹配的对照,则收集了其手术后相同时间段内的相应信息。优势比为 ,95%置信区间为1.64至30.2。尽管优势比较大,提示压力事件与乳腺癌复发风险相关,但样本量较小导致置信区间较宽,因此对结果的解释需谨慎。
for operable breast cancer were identified. Using a large database, they were individually matched for several prognostic factors (including type of operation and use of adjuvant chemotherapy) and socio- demographic factors with 50 women who had not had a recurrence. For each woman having a recurrence (the cases) data were collected on stressful life events (such as divorce or the death of their spouse) in the period between their operation and recurrence, while for their matched control the same information was sought for the same period from the time of their own operation. The odds ratio is simply , and the confidence interval is from 1.64 to 30.2. Although the odds ratio is large, suggesting that there is an association between stressful events and risk of recurrence of breast cancer, the small sample means that there is a wide confidence interval and thus it is necessary to be cautious about the interpretation of the finding.

10.11.3 合并多个 表 10.11.3 Combining several tables

有两种常见情况,我们可能希望合并多个频数表的结果,尤其是 表。第一种情况是我们希望在单一研究中检验两个变量之间的关系,但需要考虑第三个变量的变异。例如,在一项病例对照研究中,两组可能存在不同的年龄分布。
Two common situations arise where we might wish to combine results from several frequency tables, especially tables. The first is where we wish to examine the relation between two variables in a single study, but making allowance for variation in a third variable. For example, in a case- control study we might have a different age distribution in the two

可能会怀疑年龄影响了暴露与结果之间的关系。我们可以为若干年龄组分别制作一个 表格,然后将这些表格合并,以获得一个有效调整了年龄的总体优势比。第二种情况是我们希望合并来自多个独立研究的数据。这种分析类型越来越常见,用于整合多个临床试验的结果,从而对所有可用数据进行客观的综述或荟萃分析(见第15.5.2节)。这两种情况下的分析均采用Mantel-Haenszel方法,该方法详见Fleiss(1981,第173页)。
groups and might suspect that age affected the relation between the exposure and the outcome. We could produce a table for each of several age groups, and then combine the tables to get an overall odds ratio that was effectively age- adjusted. The second is where we wish to combine data from several independent studies. This is an increasingly common type of analysis for combining the results from many clinical trials to perform an objective overview or meta- analysis of all the available data (see section 15.5.2). For both of these circumstances the analysis is by the Mantel- Haenszel method, which is described in Fleiss (1981, p. 173).

当感兴趣的结局较为罕见时,优势比大致等同于相对风险。然而,对于常见事件,两者可能有较大差异,因此最好将优势比视为一种独立的测量指标。当我们将某一特征的有无与多个因素相关联时,也会用到优势比。此类分析在第12.5节中有详细描述。
The odds ratio is approximately the same as the relative risk if the outcome of interest is rare. For common events, however, they can be quite different, so it is best to think of the odds ratio as a measure in its own right. Odds ratios also feature when we relate the presence or absence of a feature to more than one factor. This analysis is described in section 12.5.

10.12 结果的呈现 10.12 PRESENTATION OF RESULTS

频数数据在呈现结果时相对较少遇到问题。
我支持同时提供所有观察到的频数以及四舍五入后的百分比(或比例)汇总。
这是因为通常无法仅通过报告的百分比重建频数。
提供百分比有助于快速直观地评估不同规模组间的差异。
表10.18和表10.21就是这种呈现方式的示例。
当行数超过两行时,应为所有行提供百分比。
Frequency data pose relatively few problems when presenting results. I am in favour of giving all observed frequencies together with summaries as rounded percentages (or proportions). This is because it is often not possible to reconstruct the frequencies from reported percentages. It is useful to give percentages to allow a quick visual appraisal of variation among groups of varying size. Tables 10.18 and 10.21 are examples of this style of presentation. Percentages should be given for all rows when there are more than two rows.

比例比较应附带检验统计量 值以及差异的置信区间。报告卡方检验时,应同时报告检验统计量 、自由度和 值。如果明确仅涉及 表,自由度可以省略。本书中我使用 表示检验统计量,以区别于理论分布 。然而,通常也用 表示检验统计量,二者均可接受。
Comparisons of proportions should be accompanied by the test statistic , the value, and a confidence interval for the difference. The test statistic , the degrees of freedom and the value should all be quoted when reporting Chi squared tests. The degrees of freedom can be omitted if it is clear that only tables are involved. In this book I have used for the test statistic to distinguish it from which is the theoretical distribution. It is common, however, to use for the test statistic too, and either is acceptable.

10.13 总结 10.13 SUMMARY

通过比例比较处理频数数据优于卡方方法,因为它提供了感兴趣量的估计值及相关置信区间。相比之下,卡方检验仅给出 值。然而,对于较大表格,没有简单的替代方法,只能使用卡方检验或有时采用秩方法。因此,频数数据比连续数据更难用统计估计方法处理,更多依赖假设检验。基于此,
The approach to frequency data via the comparison of proportions is preferable to the Chi squared approach because it provides estimates of quantities of interest and related confidence intervals. In contrast, Chi squared tests yield only values. For larger tables, however, there is no simple alternative to using Chi squared tests or sometimes rank methods. Frequency data thus are less amenable than continuous data to statistical methods of estimation rather than hypothesis testing. For this reason

单独为了“简化”分析而将连续变量分类是不明智的。虽然这在探索数据时常有用,但在分析时通常不建议,因为这会丢失信息;表10.5中有示例。也存在用于建模频数数据的统计方法,称为对数线性模型,但超出本书范围。
alone, it is unwise to categorize a continuous variable to 'simplify' the analysis. While this is often useful when exploring a data set, it is not generally advisable when analysing data as it is throwing away information; an example is given in Table 10.5. There are also statistical methods for modelling frequency data, called log- linear models, but they are beyond the scope of this book.

无论采用何种分析方法,重要的是考虑类别的任何顺序。Moses 等(1984)回顾了相关选项,本章中也描述了这些方法。
Whatever method of analysis is used, it is important to take any ordering of categories into account. Moses et al. (1984) have reviewed the options, and the methods have been described in this chapter.

频数表还出现在其他类型分析中,目的有所不同。两个例子是比较观察者的评估(如疾病分期)和在诊断背景下用一个变量预测另一个变量。这两种情况将在第12章讨论。
Frequency tables also arise in other types of analysis, where the aims are rather different. Two such cases are the comparison of observers' assessments, such as of stage of disease, and the use of one variable to predict another in the context of diagnosis. Both of these situations are considered in Chapter 12.

练习 EXERCISES

【10】1 一项研究旨在观察皮肤对接触性过敏原二硝基氯苯(DNCB)无反应的患者,是否对皮肤刺激剂蓖麻油表现同样的阴性反应(Roth 等,1975)。下表显示了173名皮肤癌患者同时进行DNCB和蓖麻油皮肤反应测试的结果。
10.1 A study was carried out to see if patients whose skin did not respond to dinitrochlorobenzene (DNCB), a contact allergen, would show an equally negative response to croton oil, a skin irritant (Roth et al., 1975). The following table shows the results of simultaneous skin reaction tests to DNCB and croton oil in 173 patients with skin cancer.

DNCB
阳性阴性总计
蓖麻油阳性8148129
阴性232144
总计10469173
DNCB
+ve-veTotal
Croton oil+ve8148129
-ve232144
Total10469173

(a) 作者报告“两项测试无相关性”。请进行适合该临床问题的分析。
(a) The authors reported 'no correlation' between the two tests. Carry out an analysis appropriate to the clinical question posed.

(b) DNCB测试的结果在不同癌症分期患者中进行了比较,如下表所示。
(b) The results of the DNCB test were compared for patients with different stages of cancer, as shown in the following table.

皮肤癌分期
I期II期III期总计
DNCB反应阳性393926104
阴性13193769
总计525863173
Stage of skin cancer
IIIIIITotal
DNCB reaction+ve393926104
-ve13193769
Total525863173

DNCB反应性与这些患者的癌症分期有关吗?
Is DNCB reactivity related to stage of cancer in these patients?

10.2 一项调查旨在检验假设:声音低沉的男性可能比声音较高的男性具有更高的睾酮水平(Lyster, 1984)。对专业和学生男歌手询问了他们有多少兄弟姐妹。以下表格显示了195名有兄弟姐妹的歌手的结果:
10.2 A survey was carried out to test the hypothesis that men with deep singing voices are likely to have higher levels of testosterone than men with higher voices (Lyster, 1984). Professional and student male singers were asked how many brothers and sisters they had. The results for the 195 singers with siblings are shown in the following table:

声部组兄弟数姐妹数总计
男低音77 (62%)47124
男低音-男中音38 (56%)3068
男中音75 (56%)58133
男高音-男中音5 (56%)49
男高音27 (42%)3764
假声男高音9 (38%)1524
总计231 (55%)191422
Voice groupBrothersSistersTotal
Bass77 (62%)47124
Bass-baritone38 (56%)3068
Baritone75 (56%)58133
Tenor-baritone5 (56%)49
Tenor27 (42%)3764
Counter-tenor9 (38%)1524
Total231 (55%)191422

报告作者写道:
The author of the report wrote:

“对六个声部组中兄弟和姐妹的频数进行 检验未见显著性。另一方面,如果仅考虑男低音、男高音和假声男高音……则得到 ;自由度 。因此,在极端声部之间可以证明两变量之间存在关系,但在中间声部范围内关系不存在。”
'A test on the frequencies of brothers and sisters in the six voice- groups yields no significance. If, on the other hand, one chooses to look simply at bass, tenor and counter- tenor … one obtains ; df ; . A relationship between the two variables can thus be demonstrated at the extremes, but breaks down in the middle range.'

(a) 忽略声部组顺序的 检验结果为 ,自由度为5, 。请对作者的解释进行评论。
(a) The test ignoring the ordering of the voice groups gives on 5 df, . Comment on the author's interpretation.

(b) 为什么该检验不适合本研究目的?请进行适当的分析并解释结果。
(b) Why is this test not appropriate to the aim of the study? Perform an appropriate analysis and interpret the answers.

(c) 作者进行了仅涉及三个组的第二次分析。为什么该分析无效?
(c) The author performed a second analysis involving just three groups. Why is this analysis invalid?

(d) 为什么他引用的 值 11.27 明显错误?
(d) Why is his quoted value of of 11.27 clearly wrong?

(e) 数据的哪个特征破坏了所有卡方检验的一个基本假设?
(e) What feature of the data breaks a fundamental assumption of all Chi squared tests?

(f) 这个问题有多重要,我们能做些什么?(这相当困难!)
(f) How important is this problem, and what can we do about it? (This is rather difficult!)

10.3 一项随机对照临床试验比较了单次泼尼松龙剂量和安慰剂对急性哮喘儿童的效果(Storr 等,1987)。安慰剂组有73名儿童,泼尼松龙组有67名。论文结果部分开头写道:“安慰剂组中有2名患者(3%,95% 置信区间为1%至6%)和泼尼松龙组中有20名患者(30%,19%至41%)在首次检查时出院()。”方法部分说明该P值是用Fisher精确检验得出的。
10.3 A randomized controlled clinical trial was carried out to compare the effects of a single dose of prednisolone and placebo in children with acute asthma (Storr et al., 1987). There were 73 children in the placebo group and 67 in the prednisolone group. The results section of the paper begins with the following statement: '2 patients in the placebo group (3%, 95% confidence interval - 1 to 6%) and 20 in the prednisolone group to ) were discharged at first examination .' The methods section explains that this P value was derived using Fisher's exact test.

(a) 使用Fisher精确检验而非卡方检验合理吗?
(a) Was it reasonable to use Fisher's exact test rather than the Chi squared test?

(b) 置信区间有什么问题,更好的分析方法是什么?
(b) What is wrong with the confidence intervals, and what would be a better analysis?

10.4 下表显示了14岁男孩和女孩的卧床时间(小时)(Macgregor 和 Balding,1988)。时间均向上取整到半小时。
10.4 The following table shows the number of hours spent in bed by 14 year old boys and girls (Macgregor and Balding, 1988). Times were rounded up to the next half hour.

(a) 可以用哪些分析方法比较男孩和女孩的分布?
(a) Which methods of analysis could be used to compare the distributions for boys and girls?

(b) 男孩和女孩之间有差异吗?
(b) Is there any difference between boys and girls?

≤ 7.07.58.0卧床时间(小时)
8.59.09.510.0> 10.0总计
男孩88109210324359313182851670
女孩92108217349436334198651799
总计1802174276737956473801503469
≤ 7.07.58.0Time spent in bed (hours)
8.59.09.510.0&gt; 10.0Total
Boys88109210324359313182851670
Girls92108217349436334198651799
Total1802174276737956473801503469

10.5 对65名接受或正在接受硫代硫酸钠金治疗类风湿关节炎的患者进行了研究(Ayesh 等,1987)。研究目的是探讨硫代硫酸钠金(SA)毒性是否可能与硫氧化能力相关,硫氧化能力通过硫氧化指数(SI)评估。数据见练习3.1。 被视为硫氧化受损。他们得到了如下表格:
10.5 A study was made of 65 patients who had received or were receiving sodium aurothiomalate as a treatment for rheumatoid arthritis (Ayesh et al., 1987). The aim was to examine the possibility that toxicity to sodium aurothiomalate (SA) might be linked to sulphoxidation capacity, as assessed by the sulphoxidation index (SI). The data were given in Exercise 3.1. Values of were taken as indicating impaired sulphoxidation. They obtained the following table:

主要不良反应 (毒性)
总计
硫氧化受损30939
71926
总计372865
Major adverse reaction (toxicity)
YesNoTotal
Impaired sulphoxidationYes30939
No71926
Total372865

作者写道:“表现出SA毒性的患者中硫氧化受损的发生率(30/37;81.0%)显著高于无不良反应组(9/28;32.1%)()。同样,硫氧化受损者的毒性发生率显著高于硫氧化正常者(30/39;76.9% 对 7/26;26.9%)()。”
The authors wrote: 'The incidence of impaired sulphoxidation in patients showing SA toxicity (30/37; 81.0%) was significantly greater than in the group without adverse reaction . Similarly, the incidence of toxicity was significantly increased in those with impaired sulphoxidation compared to those with extensive sulphoxidation .

(a) 为什么上述两个卡方检验不能同时正确?
(a) Why can't both of the above Chi squared tests be correct?

(b) 对表中数据进行卡方检验,并将你的结果与上述段落中的两个结果进行比较。
(b) Carry out a Chi squared test of the data in the table and compare your answer with the two results in the above paragraph.

10.6 在印度喀拉拉邦1982年至1986年登记的口腔癌患者中,研究了癌症部位与嚼槟榔、吸烟或饮酒习惯的关系(Sankaranarayanan 等,1989)。年龄大于30岁的患者数据汇总如下表:
10.6 Among patients with oral cancer registered in Kerala, India, between 1982 and 1986, the relation between the site of the cancer and betel chewing, smoking or alcohol consumption was examined (Sankaranarayanan et al., 1989). The data for patients aged are summarized in the following table:

习惯口腔内部位
舌头 (n = 175)颊黏膜 (n = 300)其他 (n = 156)
咀嚼146267121
吸烟71166102
饮酒517146
无上述习惯17126
HabitIntra oral subsite
Tongue (n = 175)Buccal mucosa (n = 300)Other (n = 156)
Chewing146267121
Smoking71166102
Alcohol517146
None of these17126

可以进行什么样的检验来关联习惯与癌症部位?
What sort of test could be performed to relate habit to site of cancer?

10.7 65名高危妊娠期高血压孕妇参加了一项随机对照试验,比较妊娠晚期每日服用100毫克阿司匹林与匹配安慰剂的效果(Schiff 等,1989)。观察到的高血压发生率见下表:
10.7 Sixty- five pregnant women at a high risk of pregnancy- induced hypertension participated in a randomized controlled trial comparing of aspirin daily and a matching placebo during the third trimester of pregnancy (Schiff et al., 1989). The observed rates of hypertension are shown in the following table:

阿司匹林组安慰剂组总计
高血压41115
无高血压302050
总计343165
Aspirin treatedPlacebo treatedTotal
Hypertension41115
No hypertension302050
Total343165

这些数据是否表明每日服用阿司匹林能降低妊娠后期高血压的风险?
Do these data suggest that daily aspirin reduces the risk of hypertension in the last trimester of pregnancy?

【10】8 一项病例对照研究被开展以调查听神经瘤的病因(Preston-Martin 等,1989)。诊断时年龄在25至69岁的男性,且居住在洛杉矶县的患者符合纳入标准。
10.8 A case- control study was carried out to investigate the aetiology of acoustic neuromas (Preston- Martin et al., 1989). Men aged 25- 69 at

共识别出118名存活且能够接受访谈的男性患者。由于医生拒绝许可(12人)、患者选择不参与(9人)或无法联系患者(7人),共有28名患者未接受访谈。在剩余的86名患者中,研究人员确定并访谈了同一族裔且年龄相差不超过五岁的邻居对照。
the time of diagnosis who were resident in Los Angeles County were eligible for inclusion. A total of 118 men were identified who were alive and able to be interviewed. Twenty- eight patients were not interviewed because the physician refused permission (12), the patient chose not to participate (9), or the patient could not be located (7). For 86 of the remaining patients the researchers identified and interviewed a neighbourhood control of the same race and within five years of age.

每对病例-对照均由同一访谈员以相同方式进行访谈,收集关于各种生活经历的信息。特别关注工作中暴露于噪声的情况。总体上,58名病例和46名对照曾有工作中暴露于噪声的经历。20对病例-对照中,病例有噪声暴露而对照无,8对中对照有噪声暴露而病例无。
Both members of each case- control pair were interviewed in the same manner by the same interviewer to obtain information about various life experiences. Exposure to loud noise at work was of particular interest. Overall 58 cases and 46 controls had had some exposure to loud noise at work. There were 20 case- control pairs for which the case but not the control had had such exposure, and 8 pairs where the control but not the case had had some exposure.

(a)进行适当分析,比较病例组和对照组中暴露比例的差异。
(a) Carry out an appropriate analysis to compare the proportions of exposed cases and controls.

(b) 计算工作中暴露于噪声与听神经瘤之间的比值比。
(b) Calculate the odds ratio for acoustic neuroma associated with exposure to loud noise at work.

11 两个连续变量之间的关系 11 Relation between two continuous variables

11.1 关联、预测与一致性 11.1 ASSOCIATION, PREDICTION AND AGREEMENT

大量统计分析都是为了研究一组受试者中两个变量之间的关系。此类分析的三个主要目的可能是:
A high proportion of statistical analyses are carried out to study the relation between two variables within a group of subjects. Three main purposes of such analyses might be:

1.评估两个变量是否相关,即一个变量的值是否倾向于随着另一个变量值的增高而增高(或相反,降低);

  1. to assess whether the two variables are associated, that is, if the values of one variable tend to be higher (or, alternatively, lower) for higher values of the other variable;
    2.根据已知的一个变量值预测另一个变量的值;
  2. to enable the value of one variable to be predicted from any known value of the other variable;
    3.评估两个变量值之间的一致程度;这种情况最常见于比较测量或评估同一事物的不同方法时。
  3. to assess the amount of agreement between the values of the two variables; most commonly this situation arises in the comparison of alternative ways of measuring or assessing the same thing.

本章将讨论前两种可能性。一致性问题将在第14.2节中讨论。
In this chapter I shall consider the first two possibilities. The question of agreement is dealt with in section 14.2.

第10章介绍了研究分类变量间关联的方法。本章将讨论用于评估连续变量间关联的类似方法,即相关分析。相比之下,本章首次提及从一个变量预测另一个变量的方法。本章讨论从一个连续变量预测另一个连续变量,采用的技术是线性回归。当一个变量是分类变量时,需用稍有不同的逻辑回归技术,该内容将在第12章介绍。
Methods for studying association between categorical variables were introduced in Chapter 10. In this chapter I shall consider comparable methods for assessing the association between continuous variables, using the method known as correlation. In contrast, this is the first mention of methods for predicting one variable from another. This chapter considers the prediction of one continuous variable from another, for which the technique of linear regression is used. The slightly different technique of logistic regression, which is needed when one variable is categorical, will be considered in Chapter 12.

本章专注于两种技术—相关和回归,这两者常常被一起呈现,以至于容易给人一种它们不可分割的印象。事实上,它们的目的不同,真正同时对同一数据集进行两种分析的情况较少。相关与回归之间的混淆,很可能源于许多教科书中对两者技术区分不清,而这种不清楚又源于两种方法在数学上的紧密联系。显然,进行某种特定分析的理由至关重要,本章将特别强调这一点。
This chapter is devoted to two techniques, correlation and regression, which are so often presented together that it is easy to get the impression that they are inseparable. In fact, they have distinct purposes and it is relatively rare that one is genuinely interested in performing both analyses on the same set of data. The confusion that clearly exists between correlation and regression may well stem from poor differentiation between the techniques in many textbooks, which in turn arises from the very close mathematical relation between the two methods. Clearly the rationale for

进行特定分析的理由是至关重要的,本章将对此给予特别强调。
carrying out a particular analysis is of paramount importance, and this aspect will be particularly stressed in this chapter.

11.2 相关性 11.2 CORRELATION

相关性是研究两个连续变量之间可能关联时使用的分析方法。图11.1显示了18名年龄在23至61岁的正常成年人中体脂百分比()与年龄的关系。这些数据来自一项关于评估身体成分新方法的小型研究。两变量的数值之间似乎存在某种关联;我们可以看到年龄较大的人体脂百分比倾向于较高。
Correlation is the method of analysis to use when studying the possible association between two continuous variables. Figure 11.1 shows the relation between body fat percentage and age among 18 normal adults aged 23 to 61. The data come from a small study investigating a new method of assessing body composition. There appears to be some association between the values of the two variables; we can see that there is a tendency for the older people to have a higher percentage of body fat.

如果我们想测量关联的程度,可以通过计算相关系数来实现,通常简称为相关性。标准方法(通常归功于Pearson)得到一个称为的量,其取值范围为。相关系数衡量两个变量数值之间的“直线”关联程度。因此,当散点图中所有点都完美地落在一条直线上时,的值为,如图11.2所示。图中还展示了具有中间值的数据示例。当一个变量的较高数值与另一个变量的较高数值相关时,两个变量的相关性为正;如果一个变量趋于较低而另一个变量变高,则相关性为负。相关系数接近零表示两个变量之间没有线性关系(即不相关)。
If we want to measure the degree of association, this can be done by calculating the correlation coefficient, often loosely just called the correlation. The standard method (often ascribed to Pearson) leads to a quantity called which can take any value from to . This correlation coefficient measures the degree of 'straight- line' association between the values of the two variables. Thus a value of or is obtained if all the points in a scatter diagram lie on a perfect straight line, as shown in Figure 11.2. Also shown are examples of data with intermediate values of . The correlation between two variables is positive if higher values of one variable are associated with higher values of the other and negative if one variable tends to be lower as the other gets higher. A correlation of around zero indicates that there is no linear relation between the values of the two


图11.1 18名正常成年人中体脂百分比(%fat)与年龄的关系(Mazess等,1984年)。
Figure 11.1 Body fat percentage (%fat) related to age for 18 normal adults (Mazess et al., 1984).

显然,图11.1中的变量是正相关的;实际上,相关系数计算结果为
variables (i.e. they are uncorrelated). Clearly the variables in Figure 11.1 are positively correlated; in fact the correlation coefficient can be calculated to be .

我们用测量的是什么?本质上,是点围绕潜在线性趋势的散布程度的度量:点的散布越大,相关性越低。在前述研究中,使用双光子吸收法测量体脂占总体重的百分比。图11.3显示了相同18名受试者的体脂百分比与体重的关系。显然,数据散布较大,且体脂百分比与体重之间没有明显的潜在关系。这两个变量的相关系数为0.03,证实了视觉印象。
What are we measuring with ? In essence is a measure of the scatter of the points around an underlying linear trend: the greater the spread of the points the lower the correlation. In the study already referred to, dual- photon absorptiometry was used to derive a measure of body fat as a percentage of total body mass. Figure 11.3 shows fat plotted against weight for the same 18 subjects. It is clear that there is considerable scatter with no obvious underlying relationship between fat and weight. The correlation between these two variables is 0.03, confirming the visual impression.

一个非常强相关的例子是不同哺乳动物物种的母体体重与胎儿体重的数据。图11.4显示了这些数据经过对数转换后的散点图。两个变量的相关系数为0.985,且关系在从蝙蝠到鲸鱼的极端物种之间表现出极其一致的规律。
An example of very strong correlation is given by data relating maternal and fetal weight of different species of mammal. Figure 11.4 shows a plot of these data after log transformation. The correlation between the two variables is 0.985, and the relation is clearly remarkably consistent from bats at one extreme through to whales at the other.

11.2.1 数据分布 11.2.1 Data distribution

相关系数可以针对任何数据集计算。然而,相关性假设检验的有效性有一个限制条件,即两个变量必须是在随机抽取的个体样本上观察到的,且至少有一个变量在总体中服从正态分布。为了计算的有效置信区间,两个变量都应服从正态分布。
The correlation coefficient can be calculated for any data set. However, there is a restriction on the validity of the associated hypothesis test, which is that the two variables are observed on a random sample of individuals and that the data for at least one of the variables have a Normal distribution in the population. For the calculation of a valid confidence interval for both variables should have a Normal distribution.

因此,实际上,为了使用Pearson的,最好两个变量都近似服从正态分布。这类数据通常呈现大致椭圆形的分布,椭圆的拉长程度与相关系数有关。然而,对于样本量较小或接近的情况,这一特征可能不易察觉。检验假设的最简单方法是检查数据的散点图,计算相关系数时应常规生成该图。通过散点图通常可以很容易判断数据是否呈现合理的椭圆形分布。虽然可以绘制正态概率图,并通过Shapiro-Wilk W检验(见第7章)正式检验正态性,但通常不必如此,因为散点图通常足够。
In practice, therefore, it is preferable for both variables to have approximately Normal distribution for any use of Pearson's . Data of this type will display a roughly elliptical pattern, with the degree of elongation of the ellipse being related to the correlation coefficient. For small samples, or where is near or , this feature may be hard to detect, however. The easiest way to check the validity of the hypothesis test is by examining a scatter diagram of the data, which ought to be produced as a matter of routine whenever correlation coefficients are calculated. It should be easy to tell whether the data show a reasonably elliptical pattern. Normal plots could be produced, and Normality can be tested formally by the Shapiro- Wilk W test (see Chapter 7), but it is not really necessary because the scatter plot will usually suffice.

如果数据不服从正态分布,可以对一个或两个变量进行变换,如图11.4所示的数据,或者计算非参数相关系数,详见第11.4节。
If the data do not have a Normal distribution either or both of the variables can be transformed, as for the data shown in Figure 11.4, or a non- parametric correlation coefficient can be calculated, as described in section 11.4.

关于的数学计算、置信区间及相关假设检验详见第11.7节。
The mathematical calculations for , its confidence interval, and the associated hypothesis tests are shown in section 11.7.

280 两个连续变量之间的关系
280 Relation between two continuous variables

(b)
(b)


(d)
(d)


图11.2 相关系数(r)分别为:(a) 1.0;(b) -1.0;(c) 0.0;(d) 0.3;(e) -0.5;(f) 0.7的数据。
Figure 11.2 Data with correlation coefficients (r) of (a) 1.0; (b) -1.0; (c) 0.0; (d) 0.3; (e) -0.5; (f) 0.7.


图11.3 18名正常成年人脂肪百分比与体重的关系(Mazess等,1984年)。
Figure 11.3 Relation between percentage of fat and bodyweight in 18 normal adults (Mazess et al., 1984).


图11.4 121种哺乳动物胎儿总重与非孕母体重的关系(Leitch等,1959年)。
Figure 11.4 Relation between total fetal weight and non-pregnant maternal weight in 121 species of mammal (Leitch et al., 1959).

11.2.2 的置信区间 11.2.2 Confidence interval for

【11】2.2 的置信区间 我们可以在假设样本具有代表性的前提下,获得总体相关系数的置信区间。对于图11.1中的数据,相关系数为0.79。采用第11.7节描述的方法,我们可以得到相关系数的95%置信区间为0.52到0.92。正如小样本中常见的那样,置信区间较宽,但这确实表明两个变量之间存在较强的关联。
11.2.2 Confidence interval for We can obtain a confidence interval for the correlation in the population, on the assumption that the sample is representative. For the data in Figure 11.1 the correlation coefficient is 0.79. Using the method described in section 11.7 we can obtain the confidence interval for the correlation coefficient as being from 0.52 to 0.92. As is usual in small samples, the confidence interval is wide, but it does suggest that there really is quite a strong association between the two variables.

11.2.3 的假设检验 11.2.3 Hypothesis test for

11.2.3 的假设检验 有一种基于 分布的简单显著性检验,用于检验无关联的原假设。该方法在第11.7节中有描述。然而,表B7列出了临界值,可以直接查找观察到的 值;这对于大多数实际情况已足够。例如,图11.1中显示的数据中,脂肪百分比与年龄的相关系数为0.79,查表B7可知
11.2.3 Hypothesis test for There is a simple test of significance of the null hypothesis of no association which is based on the distribution. The method is described in section 11.7. However, Table B7 shows critical values which allow observed values of to be looked up directly; these should suffice for most practical purposes. For example, the correlation between fat and age in the data shown in Figure 11.1 was 0.79, and from Table B7 we can see that .

11.3 相关性的使用与误用 11.3 USE AND MISUSE OF CORRELATION

除第11.2.1节提到的分布假设外,另一个限制是所有观察值应相互独立。实际上,这意味着每个研究对象的每个变量只能有一个观察值。当部分或全部受试者有多个观察值时,相关分析不再有效。例如,若对孕妇在不同孕周测量血压和雌激素水平多次,使用相关分析来关联这两者是不正确的。在这种情况下,正确的分析可能非常复杂。
As well as the distributional assumptions mentioned in section 11.2.1, another restriction is that all the observations should be independent. In practice this means that only one observation of each variable should come from each individual in the study. The analysis is not valid when there is more than one observation for some or all of the subjects. For example, it would not be correct to use correlation to relate, say, blood pressure and oestrogen levels in pregnant women with varying numbers of observations at different gestational ages. In such circumstances a proper analysis can be very complex.

即使上述假设未被违反,相关分析的使用也并非看上去那么简单。事实上,相关分析的误用非常普遍,以至于一些统计学家希望该方法从未被发明。最明显的普遍误用出现在记录大量变量的研究中。显然,变量越多,可以计算的相关系数越多,随后挑选出统计显著的相关系数。虽然“数据挖掘”在探索性分析中有限度地被接受,但若过度使用,过度解读的风险极大。例如,仅有10个变量,就可以计算45个变量对之间的相关系数。此问题将在11.8节进一步讨论。
Even when the assumptions just mentioned are not violated the use of correlation is not as simple as it looks. Indeed, misuse of correlation is so common that some statisticians have wished that the method had never been devised. The most obvious general misuse occurs in studies in which large numbers of variables have been recorded. Clearly, with many variables it is possible to calculate hundreds of correlation coefficients and then pick out just those which are statistically significant. While 'data- dredging' is acceptable in a limited way in exploratory analyses, when taken to extremes the scope for over- interpretation is considerable. For example, even with only ten variables 45 correlations between pairs of variables can be calculated. This problem is discussed further in section 11.8.

相关性还有几种较为具体的误用类型,性质各异但均常见。下面讨论六种类型。每种情况下,数学计算本身无误,但解释存在缺陷。
There are several rather more specific misuses of correlation, each somewhat different in nature but all frequently seen. Six types are discussed below. In each case there is nothing wrong with the mathematical calculations, but the interpretation is flawed.

11.3.1 涉及时间的虚假相关 11.3.1 Spurious correlations involving time

两个变量若均为随时间重复测量,其相关性可能极具误导性。通过这种方式,可以“证明”汽油价格与离婚率、黄油消费与农民收入(负相关)等关系。另一个例子见第5.13节。
The correlation of two variables both of which have been recorded repeatedly over time can be grossly misleading. By such means one may demonstrate relationships between the price of petrol and the divorce rate, consumption of butter and farmers' incomes (a negative relation), and so on. Another example was given in section 5.13.

对个体随时间变化的两个变量进行研究时,同样需谨慎。这类相关性往往是虚假的:在计算相关之前,必须去除数据中的时间趋势,这需要专家协助。时间相关数据将在第14.6节进一步讨论。
The same caution applies to studying two variables over time for an individual. Such correlations are often spurious: it is necessary to remove the time trends from such data before correlating them, and this is an area that requires expert assistance. Time- related data are considered further in section 14.6.

11.3.2 个体的有限抽样 11.3.2 Restricted sampling of individuals

如前所述,隐含的假设是所研究的受试者是来自某一特定人群(如孕妇或高血压男性)的随机样本(或近似随机样本)。因为某个变量的取值而有意地增加或减少样本中的个体,会对相关系数产生显著影响。例如,如果我们向图11.1所示的数据集中添加几个儿童,相关系数将大幅增加;而如果排除身高超过 的个体,相关系数则会降低(降至 )。这两种操作都无法使相关系数得到有效解释,因为样本不再是合适的随机样本。相关分析对样本选择尤为敏感,因为每个变量的个体间变异直接参与计算。
As already indicated, there is an implicit assumption that the subjects being studied are a random sample (or nearly so) from some specified population of individuals, such as pregnant women or hypertensive men. Deliberately adding or taking away from our sample some individuals because of their values of one of the variables can have a dramatic effect on the correlation. For example, if we added a few children to the data set shown in Figure 11.1 we would increase the correlation considerably, whereas if we excluded anyone taller than we would decrease the correlation (to ). Neither manoeuvre would allow a valid interpretation of the correlation coefficient because the sample would no longer be a proper random sample. Correlation analysis is especially sensitive to the sample selection because the between subject variation in each variable enters directly into the calculation.

11.3.3 混合样本 11.3.3 Mixed samples

当样本包含不同亚组时,计算相关系数可能会产生误导。例如,图11.1中的体脂数据涉及14名女性和4名男性。男性的体脂百分比通常较低,且这4名男性明显比女性年轻,因此混合性别会导致相关系数被高估(见图11.5)。因此,最好仅考虑女性样本,此时相关系数较低,为 。混合亚组的另一个后果是,混合后的数据可能不服从正态分布,但除非各组差异很大且样本量充足,否则难以检测该效应。
It may be misleading to calculate the correlation when the sample comprises different subgroups. For example, the body fat data in Figure 11.1 relate to 14 women and 4 men. Body fat percentage tends to be lower in men, and it happens that the four men in this study were considerably younger than the women, so mixing the sexes tends to inflate the correlation (see Figure 11.5). It would therefore be better to consider the


图11.5 按年龄划分的体脂百分比,男性(+)和女性(O)。
Figure 11.5 % fat by age showing males (+) and females (O).

women only, for whom we get rather lower . Another consequence of the mixing of subgroups is that the data (when mixed) may not be Normally distributed, but the effect cannot be detected unless the groups are very different and the sample is large.

11.3.4 评估一致性 11.3.4 Assessing agreement

医学研究中经常需要比较两种测量同一数量的方法。实验室方法中常见此类问题,临床医学中也很普遍,特别是在无法直接测量感兴趣数量时。血压就是一个明显的例子。
In medical research there is frequently the need to compare two methods of measuring the same quantity. Laboratory methods throw up many such problems, but they are also common in clinical medicine, particularly where it is not possible to measure directly the quantity of interest. Blood pressure is an obvious example.

分析此类数据最常用的方法是计算相关系数,但这是一种误解的分析。如前所述,相关系数衡量的是两个量之间的关联程度;它不衡量它们的一致性(Bland 和 Altman, 1986)。方法比较研究在第14.2节中有详细讨论。
The most common method of analysing such data is to calculate the correlation coefficient, but this is a misconceived analysis. As we have seen, the correlation coefficient measures the degree of association between two quantities; it does not measure how closely they agree (Bland and Altman, 1986). Method comparison studies are discussed in detail in section 14.2.

使用相关分析研究初始测量值与该测量值变化之间的关系时,会出现一种截然不同的问题。
A rather different problem occurs with the use of correlation to study the relation between an initial measurement and the change in that measure.

随时间的变化。例如,我们可能有兴趣观察一种旨在降低血清胆固醇的饮食是否在初始血清胆固醇较高的人群中更有效。这是一个合理的问题,但遗憾的是,在这里使用相关性是具有误导性的。原因在于,对于任意两个量 会与 相关。事实上,即使 是随机数样本,我们也预期 之间的相关系数约为0.7。(你可以用表B13中的随机数表尝试验证。)换句话说,即使饮食无效,我们也预期初始血清胆固醇与血清胆固醇变化之间会有较大的相关性。这种现象称为均值回归,进一步混淆了回归与相关的概念。
ment over time. For example, we may be interested in seeing whether a diet designed to lower serum cholesterol was more effective in people with higher initial values of serum cholesterol. This is a reasonable question, but unfortunately it turns out that the use of correlation here is misleading. This is because for any two quantities and , will be correlated with . Indeed, even if and are samples of random numbers we would expect the correlation between and to be 0.7. (You can try this with some numbers from the table of random numbers in Table B13. ) In other words, we expect to obtain a large correlation between initial serum cholesterol and the change in serum cholesterol even if the diet is ineffective. The name for the phenomenon is regression to the mean, giving another confusion between regression and correlation.

解决此问题的最简单方法是取初始值和最终测量值的平均数,并计算该量与观察到的变化之间的相关性。用上述符号表示,即计算 的相关系数。如果该相关系数较大,则可以合理推断变量的较高初始水平与随时间的较大下降(或较小上升)相关。然而,这类数据的最佳处理方法较为复杂:Blomqvist(1986)和 Hayes(1988)对此有进一步讨论。这类问题远比表面复杂,建议寻求统计学专业意见。
The simplest way around this problem is to take the average of the initial and final measurement and calculate the correlation between this quantity and the observed change. In the above notation this means correlating with . If this correlation is large it may reasonably be inferred that higher initial levels of the variable are associated with larger falls over time (or smaller rises). However, the best approach to this type of data is complex: further discussion is given by Blomqvist (1986) and Hayes (1988). There is more to this type of problem than is apparent, and statistical advice is recommended.

11.3.6 部分与整体的关系 11.3.6 Relating a part to the whole

如果研究组成部分与总量之间的关系,会出现类似情况。例如,我们预期会发现以下相关性:
A similar situation arises if we study the relation between a constituent and the total amount. For example, we would expect to find a correlation between:

【1】 5岁时的身高与成年身高;

  1. height at age 5 and adult height;

【2】 黄体期长度与整个月经周期长度;
2. length of the luteal phase and length of the whole menstrual cycle; and
【3】 蛋白质摄入量与总热量摄入量;
3. intake of protein and intake of calories;

因为在每种情况下,第二个量包含第一个量,尽管不一定是显式包含。第一个量与其在总量中的补充部分之间可能没有关系(甚至存在负相关)。如前节讨论的问题,将分析表达为 之间的相关性,表明无论 是什么,这两个量都是相关的。
because in each case the second quantity contains the first, although not necessarily explicitly. There may be no relation (or even a negative relation) between the first quantity and its complement within the total. As with the problem discussed in the previous section, expressing the analysis as the correlation between and shows that the two quantities are related whatever and are.

11.4 秩相关 11.4 RANK CORRELATION

秩的概念在第2章中引入,第9章展示了其在两组连续数据比较中的应用。
The concept of ranks was introduced in Chapter 2 and applications to the comparison of continuous data from two groups were shown in Chapter 9.

在考虑两个变量之间的关系时,也可以类似地使用秩。这里的想法很简单,就是对每个变量对一组受试者进行排序,然后比较排序顺序。例如,表11.1显示了图11.1中年龄和脂肪百分比( fat)的数据,以及观察值的秩。当两个值相同时,两个值均赋予平均秩。
A similar use of ranks is possible when considering the relation between two variables. The idea here is simply to rank a set of subjects for each variable and compare the orderings. For example, Table 11.1 shows the data for age and measurements of fat from Figure 11.1, together with the ranks of the observations. Where two values are the same the average rank is assigned to both.

为了使关系更清晰,受试者按年龄排序。这样排列数据可以让我们快速判断两个变量是否可能相关,因为很容易判断第二列秩值是趋向增加还是减少。
To make the relationship clearer, the subjects have been ordered by age. Arranging data like this allows us to get a quick impression about the possibility that the two variables are associated, as it is quite easy to judge whether the values in the second column of ranks are tending to increase or decrease.

计算秩相关系数有两种常用方法,一种是斯皮尔曼(Spearman)方法,另一种是肯德尔(Kendall)方法。一般来说,计算斯皮尔曼的(常称为斯皮尔曼的)比计算肯德尔的更简单,因此这里使用斯皮尔曼系数。计算过程见第11.7节。实际上,斯皮尔曼秩相关系数与对观察值秩计算的皮尔逊相关系数完全相同。
There are two commonly used methods of calculating the rank correlation coefficient, one due to Spearman and one to Kendall. It is easier in general to calculate Spearman's (often called Spearman's (rho)) than Kendall's (tau), so it is the Spearman coefficient that is used here. The calculations are shown in section 11.7. In fact, Spearman's rank correlation coefficient is exactly the same as the Pearson correlation coefficient calculated on the ranks of the observations.

表11.1中年龄与脂肪百分比数据的秩相关系数为0.75,接近标准皮尔逊相关系数0.79。
The rank correlation between the age and fat data shown in Table 11.1 is 0.75, which is close to the value 0.79 obtained as the standard

表11.1 18名正常成人的年龄和脂肪百分比(通过双光子吸收法测量)(Mazess等,1984)
Table 11.1 Age and fat (measured by dual-photon ab sorptiometry) for 18 normal adults (Mazess et al., 1984)

受试者年龄脂肪百分比
1231.59.52
2231.527.97
3273.57.81
4273.517.83
539531.411
641625.95
745727.46
849825.24
950931.110
105310.534.716
115310.542.018
12541229.18
13561332.512
14571430.39
155815.533.013
165815.533.814
17601741.117
18611834.515
SubjectAgeRank%FatRank
1231.59.52
2231.527.97
3273.57.81
4273.517.83
539531.411
641625.95
745727.46
849825.24
950931.110
105310.534.716
115310.542.018
12541229.18
13561332.512
14571430.39
155815.533.013
165815.533.814
17601741.117
18611834.515

当然,这种接近并非总是如此。当散点图中的数据偏离椭圆形状时,两种方法往往会有所不同。由于这表明不适合计算皮尔逊相关系数,因此当明显不同的时候,应使用。实际上,我们不会同时计算,而是根据散点图的形态选择方法。秩相关可以用于任何类型的分布模式,且优点在于它不仅评估线性关联,更能反映更广泛的关联。例如,的值在对任一变量进行对数变换后保持不变。然而,我之前对相关分析的所有警告同样适用于秩相关。
Pearson correlation. This is not always the case, of course. The two methods will tend to differ when the data deviate from an elliptical shape in the scatter diagram. As this is an indication against the calculation of Pearson's , it follows that when and differ noticeably it is that should be used. In practice we do not calculate both and , but choose the method according to the appearance of the scatter diagram. Rank correlation may be used whatever type of pattern is seen and it has the advantage of not specifically assessing linear association but more general association. This may be seen, for example, from the fact that the value of is unchanged by logarithmic transformation of either of the variables. All my earlier cautions against the use of correlation apply equally to rank correlation, however.

Hughes和Jones(1985)研究了46个国家中膳食纤维平均摄入量与初潮平均年龄的关系。他们报告的相关系数为)。然而,如图11.6所示,数据倾向于聚集为两个主要群体,大致对应发达国家和发展中国家,且存在一个极端点。因此,两变量的数据均不接近正态分布。我们可能更倾向于使用秩相关,得到。我们可以将相同的值视为大致等价,因此根据表B7,较弱的秩相关依然高度显著()。
Hughes and Jones (1985) studied the relation between average intake of dietary fibre and the average age of menarche in 46 countries. They quoted a correlation coefficient of ( ). However, as Figure 11.6 shows, the data tend to cluster in two main groups, corresponding roughly to developed and developing countries, and there is one extreme point. The data are thus not near to a Normal distribution for either variable. We might, therefore, prefer to use rank correlation, which gives . We can interpret identical values of and as being roughly equivalent, so from Table B7 the rather weaker rank correlation is also highly significant ( ).


图11.6 46个国家膳食纤维平均每日摄入量与初潮平均年龄的关系(Hughes和Jones,1985)。
Figure 11.6 Relation between average daily intake of dietary fibre and mean age of menarche in 46 countries (Hughes and Jones, 1985).

我们可以用与完全相同的方法计算的置信区间。按照第11.7节的方法,纤维与初潮数据的95%置信区间为0.32至0.87。尽管值非常小,数据仍兼容总体相关系数的较宽范围可能性。
We can calculate a confidence interval for in exactly the same way as for . Following the method given in section 11.7, the confidence interval for the fibre and menarche data is from 0.32 to 0.87. The data are thus compatible with a wide range of possibilities for the population correlation, despite the very small value.

我在10.9.2节中提到,秩相关可以用来评估两个有序分类变量之间的关联程度。显然,在这种情况下会有大量的并列值,因此必须使用允许处理并列值的方法版本(参见11.7.2节)。
I mentioned in section 10.9.2 that rank correlation can be used to assess the degree of association between two ordered categorical variables. Clearly there will be huge numbers of ties in this situation so it is essential to use the version of the method that allows for them (see section 11.7.2).

应该更频繁地使用秩相关。它是唯一一种非参数方法,能提供与其参数方法等量的信息(而不仅仅是一个值),且适用范围更广。通过对数据进行排序并执行常规的Pearson相关分析,利用广泛可用的计算机程序,操作非常简便。
Rank correlation should be used more often. It is the only non- parametric method which gives as much information as its parametric equivalent (rather than just a value), and it is of wider validity. It is easy to perform using widely available computer programs by ranking the data and performing the usual Pearson correlation analysis.

11.5 调整相关系数以控制另一个变量 11.5 ADJUSTING A CORRELATION FOR ANOTHER VARIABLE

有时我们掌握了第三个变量的数据,这个变量可能影响了两个其他变量之间观察到的关系。我们可以通过计算偏相关系数来调整第三个变量的影响。偏相关系数可视为在第三个变量取值相同的个体(或国家等)中,两个变量之间的估计相关性。该方法适用于Pearson或Spearman相关系数。
Sometimes we have data on a third variable that might have influenced the observed relationship between two other variables. We can adjust for the third variable by calculating the partial correlation coefficient. We can consider this to be the estimated correlation between two variables among individuals (or countries or whatever) with the same value of the third variable. The same approach can be used for Pearson's or Spearman's correlation coefficient.

Begg和Hearns(1966)研究了血细胞比容(PCV)、纤维蛋白原及其他蛋白质(白蛋白和球蛋白)对血液粘度的相对贡献。表11.2展示了他们对32名住院患者的数据。四个变量之间的相关系数以相关矩阵形式列于表11.3。血液粘度与PCV的相关系数为0.88(),与纤维蛋白原的相关系数为0.46()。作者使用偏相关分析检验在控制PCV影响后,血液粘度与纤维蛋白原的关联是否仍然存在。偏相关系数为0.21(),提示血液粘度与纤维蛋白原的关联很大程度上可由PCV的变异解释。
Begg and Hearns (1966) were interested in the relative contributions of haematocrit (packed cell volume, PCV), fibrinogen and other proteins (albumin and globulin) to the viscosity of blood. Table 11.2 shows their data from 32 hospital patients. The correlation coefficients between the four variables are shown as a correlation matrix in Table 11.3. The correlation between blood viscosity and PCV was 0.88 and between blood viscosity and fibrinogen was 0.46 . The authors used partial correlation to see if the association of blood viscosity and fibrinogen remained after allowing for the association with PCV. The partial correlation is 0.21 , suggesting that the association between blood viscosity and fibrinogen can be largely explained by variation in PCV.

James(1985)提供了19个欧洲国家的双卵双胞胎率(DZ)和平均每日牛奶消费量与纬度的关系数据(见表11.4)。James特别关注DZ双胞胎率与纬度的关系,如图11.7所示。秩相关为0.68,极为显著()。显然,三个变量的数值均呈同步上升趋势,因此我们可以探讨观察到的关联是否可以通过牛奶消费量的变异来“解释”(统计学意义)。
James (1985) gave data on dizygotic (DZ) twinning rates and average daily milk consumption for 19 European countries in relation to latitude (see Table 11.4). James was especially interested in the relation between DZ twinning rate and latitude, shown in Figure 11.7. The rank correlation is 0.68, which is highly significant . It is clear that the values of all three variables tend to increase together so we might ask whether the

15个国家提供了人均牛奶消费量数据;对于这些国家,DZ双胞胎率与
observed association could be 'explained' (statistically) by variation in milk consumption. Information on per capita consumption of milk was available for 15 countries; for these the correlation between DZ twinning rate and

表11.2 32名住院患者的血液粘度、血细胞比容(PCV)、血浆纤维蛋白原及其他蛋白质数据(Begg和Hearns,1966)
Table 11.2 Data on blood viscosity, packed cell volume (PCV), plasma fibrinogen and other proteins from 32 hospital patients (Begg and Hearns, 1966)

患者血液粘度(厘泊)PCV(%)血浆纤维蛋白原(mg/100 ml)血浆蛋白(g/100 ml)
13.71403446.27
23.78403304.86
33.8542.52805.09
43.88424186.79
53.98457746.40
64.03423885.48
74.0542.53366.27
84.14474316.89
94.1446.752765.18
104.20484225.73
114.20462805.89
124.27474606.58
134.2743.254125.67
144.37453206.23
154.41505024.99
164.64455506.37
174.6851.254146.40
184.7350.253046.00
194.87494725.94
204.94507285.16
214.95507166.29
224.96494005.96
235.0250.55765.90
245.0251.253545.81
255.1249.53925.49
265.15563525.41
275.17505726.24
285.18476346.50
295.3853.254586.60
305.775710704.82
315.90544885.70
325.90544885.70
PatientBlood viscosity (cP)PCV (%)Plasma fibrinogen (mg/100 ml)Plasma protein (g/100 ml)
13.71403446.27
23.78403304.86
33.8542.52805.09
43.88424186.79
53.98457746.40
64.03423885.48
74.0542.53366.27
84.14474316.89
94.1446.752765.18
104.20484225.73
114.20462805.89
124.27474606.58
134.2743.254125.67
144.37453206.23
154.41505024.99
164.64455506.37
174.6851.254146.40
184.7350.253046.00
194.87494725.94
204.94507285.16
214.95507166.29
224.96494005.96
235.0250.55765.90
245.0251.253545.81
255.1249.53925.49
265.15563525.41
275.17505726.24
285.18476346.50
295.3853.254586.60
305.775710704.82
315.90544885.70
325.90544885.70

纬度的相关系数为0.61。牛奶消费与纬度的秩相关为0.92,牛奶消费与DZ双胞胎率的秩相关为0.61。我们可以计算在控制牛奶消费(M)后的纬度(L)与DZ双胞胎率(T)之间的偏相关,结果为(参见11.7.3节)。这个较小的值提示,DZ双胞胎率与纬度之间观察到的关联,可能部分由牛奶消费解释。对这类国际相关性的解释是
latitude was 0.61. The rank correlation between milk consumption and latitude is 0.92 and between milk consumption and DZ twinning rate it is 0.61. We can calculate the partial correlation between latitude (L) and DZ twinning rate (T) adjusted for milk consumption (M) as (see section 11.7.3). This small value suggests that one possible explanation for the observed association between DZ twinning and latitude might be milk consumption. Interpretation of such international correlations is

表11.3 表11.2中数据的相关矩阵
Table 11.3 Correlation matrix of the data in Table 11.2

粘度红细胞压积(PCV)纤维蛋白原
红细胞压积(PCV)0.8788
纤维蛋白原0.45730.4155
蛋白质-0.1011-0.1575-0.0512
ViscosityPCVFibrinogen
PCV0.8788
Fibrinogen0.45730.4155
Protein-0.1011-0.1575-0.0512

表11.4 纬度、年龄标准化的异卵双胞胎率及人均每日牛奶制品消费量(James, 1985)。括号内数字为排名
Table 11.4 Latitude, age-standardized dizygotic twinning rates and daily per capita consumption of milk products (James, 1985). Figures in brackets are ranks

国家纬度 (L)异卵双胞胎率 (T)每千人牛奶消费量(M)
葡萄牙40 (1.5)6.5 (2)3.8
希腊40 (1.5)8.8 (13)7.7
西班牙41 (3)5.9 (1)8.2
保加利亚42 (4)7.0 (3)-
意大利44 (5)8.6 (11.5)6.5
法国47 (6.5)7.1 (4)10.9
瑞士47 (6.5)8.1 (7.5)-
奥地利48 (8)7.5 (6)15.9
比利时51 (9.5)7.3 (5)11.6
西德51 (9.5)8.2 (9)14.1
荷兰52 (11.5)8.1 (7.5)18.9
东德52 (11.5)9.1 (16)-
英格兰和威尔士53 (13.5)8.9 (14.5)17.1
爱尔兰53 (13.5)11.0 (18)24.4
苏格兰56 (15.5)8.9 (14.5)-
丹麦56 (15.5)9.6 (17)16.8
瑞典60 (17)8.6 (11.5)20.9
挪威61 (18)8.3 (10)19.3
芬兰62 (19)12.1 (19)30.4
CountryLatitude (L)DZ twinning (T)Milk consumption rate/1000(M)
Portugal40 (1.5)6.5 (2)3.8
Greece40 (1.5)8.8 (13)7.7
Spain41 (3)5.9 (1)8.2
Bulgaria42 (4)7.0 (3)-
Italy44 (5)8.6 (11.5)6.5
France47 (6.5)7.1 (4)10.9
Switzerland47 (6.5)8.1 (7.5)-
Austria48 (8)7.5 (6)15.9
Belgium51 (9.5)7.3 (5)11.6
FR Germany51 (9.5)8.2 (9)14.1
Holland52 (11.5)8.1 (7.5)18.9
GDR52 (11.5)9.1 (16)-
England and Wales53 (13.5)8.9 (14.5)17.1
Eire53 (13.5)11.0 (18)24.4
Scotland56 (15.5)8.9 (14.5)-
Denmark56 (15.5)9.6 (17)16.8
Sweden60 (17)8.6 (11.5)20.9
Norway61 (18)8.3 (10)19.3
Finland62 (19)12.1 (19)30.4


图11.7 19个欧洲国家纬度与异卵双胞胎率的关系(James, 1985)
Figure 11.7 Relation between latitude and dizygotic twinning rate in 19 European countries (James, 1985).

然而,这类数据的分析尤其困难。注意,尽管国家从未按技术要求随机抽样,相关系数仍常用于此类数据。第11.8节讨论了解释相关系数时的一般问题。
particularly difficult, however. Note that correlation is frequently used for this type of data, although the countries are never randomly sampled as they technically should be. Section 11.8 discusses the general problems of interpreting correlation coefficients.

部分相关在医学研究中使用不多。三个或更多变量之间的关系通常用更具信息量的多元回归分析,详见第12.4节。但部分相关的计算方法将在第11.7.3节说明。
Partial correlation is not used a great deal in medical studies. The relation between three or more variables is usually investigated using the more informative multiple regression, which will be described in section 12.4. However, the method of calculating the partial correlation is explained in section 11.7.3.

11.6 使用相关系数评估非正态性 11.6 USE OF THE CORRELATION COEFFICIENT IN ASSESSING NON-NORMALITY

在第7.5.2节,我描述了如何利用正态概率图对样本观测值是否符合正态分布进行视觉评估。我介绍了Shapiro-Wilk W检验,但大多数统计软件不支持,且手工计算较难。一个更简单的替代方法是使用类似的Shapiro-Francia 检验(Royston, 1983)。
In section 7.5.2 I described the use of the Normal plot to get a visual assessment of how compatible a sample of observations is with having been drawn from a population with a Normal distribution. I described the use of the Shapiro- Wilk W test, but this is not available in most statistical packages and is too difficult to perform by hand. A much simpler alternative is to use the similar Shapiro- Francia test (Royston, 1983).

相关系数评估两个变量值之间的线性关联程度。因此,它可用于评估正态概率图的线性程度,从而判断数据是否符合正态性原假设。正态概率图是观察值与正态分数(见第7.5.4节)的散点图,我们需要计算这两组数据间的Pearson相关系数,记为
The correlation coefficient assesses the degree of straight- line association between the values of two variables. It can thus be used to assess the straightness of a Normal plot, and so whether the data are compatible with the null hypothesis of Normality. The Normal plot is a plot of the observed

数据与正态分数的关系(参见第7.5.4节),因此我们需要计算这两个量之间的Pearson相关系数,我称之为
data against the Normal scores (see section 7.5.4), so we need the Pearson correlation coefficient between these two quantities, which I will call .

我们不能使用通常的表格来评估该相关系数,因为这里的原假设是相关系数为1,而非0。更容易考虑相关系数的平方,记为,称为。表B12展示了如何评估观察到的值。
We cannot use the usual tables for assessing this correlation coefficient, because the null hypothesis here is that the correlation is 1, not 0. It is easier to consider the square of the correlation , and it is which is termed . Table B12 shows how to assess an observed value of .

表11.5显示了一些血糖数据,将在本章后面使用,数据按升序排列。表中还显示了使用第7.5.2节公式计算的期望累计频率及对应的正态分数。原始数据与正态分数的相关系数为0.9772,因此的值为。根据表B12,我们得到,说明数据与来自正态总体的样本相符。
Table 11.5 shows some blood glucose data that will be used later in this chapter, sorted into ascending order. Also shown are the expected cumulative frequencies , using the formula in section 7.5.2, and the corresponding Normal scores. The correlation coefficient between the raw data and the Normal scores is 0.9772, so the value of is . From Table B12 we get , so that the data are compatible with being a sample from a Normal population.

表11.5 24名1型糖尿病患者的空腹血糖数据(Thuesen等,1985),及正态分数的计算
Table 11.5 Fasting blood glucose data from 24 type 1 diabetic patients (Thuesen et al., 1985), with calculation of Normal scores

患者(i)血糖值Pi正态分数
14.20.026-1.947
24.90.067-1.498
35.20.108-1.236
45.30.149-1.039
56.70.191-0.875
66.70.232-0.732
77.20.273-0.603
87.50.314-0.483
98.10.356-0.370
108.60.397-0.261
118.80.438-0.156
129.30.479-0.052
139.50.5210.052
1410.30.5620.156
1510.80.6030.261
1611.10.6440.370
1712.20.6860.483
1812.50.7270.603
1913.30.7680.732
2015.10.8090.875
2115.30.8511.039
2216.10.8921.236
2319.00.9331.498
2419.50.9741.947
Patient (i)Blood glucosePiNormal score
14.20.026-1.947
24.90.067-1.498
35.20.108-1.236
45.30.149-1.039
56.70.191-0.875
66.70.232-0.732
77.20.273-0.603
87.50.314-0.483
98.10.356-0.370
108.60.397-0.261
118.80.438-0.156
129.30.479-0.052
139.50.5210.052
1410.30.5620.156
1510.80.6030.261
1611.10.6440.370
1712.20.6860.483
1812.50.7270.603
1913.30.7680.732
2015.10.8090.875
2115.30.8511.039
2216.10.8921.236
2319.00.9331.498
2419.50.9741.947

11.7 相关性—数学及实例解析 11.7 CORRELATION - MATHEMATICS AND WORKED EXAMPLES

(本节给出本章前半部分所述计算的数学公式及实例演示,可跳过而不影响整体连贯性。)
(This section gives the mathematical formulae for the calculations described in the first part of this chapter, together with a worked example. It can be omitted without loss of continuity.)

11.7.1 Pearson相关系数 11.7.1 Pearson's

通常计算的相关系数称为Pearson相关系数,或“积差”相关系数。若有两个变量,它们之间的相关性记为,通常简写为,计算公式为
The correlation coefficient that is usually calculated is called Pearson's or the 'product- moment' correlation coefficient. If we have two variables and , the correlation between them, denoted by or usually just , is given by

其中 是第 个个体的 的取值。宽泛地说, 的值可以看作是数据大致落入的椭圆形拉长程度的度量。该公式显然是对称的,即哪个变量是 、哪个是 并不影响结果。
where and are the values of and for the individual. The value of may loosely speaking be seen as a measure of the elongation of the ellipse that the data approximately fall within. The equation is clearly symmetric in that it does not matter which variable is and which is .

为了计算,使用一个更简单的公式是
For the purposes of calculation a simpler formula to use is

其中需要计算
for which it is necessary to obtain , and .

如果你已经有了均值( )以及标准差( ),公式可以简化为
If you already have the means ( and ) and standard deviations ( and ) the formula simplifies to

这样就只需要额外计算项
so that it is only necessary to calculate the extra term .

然而,这个公式不应在计算机程序中使用,因为偶尔会因舍入误差引入不准确。(应使用第一个 的公式。)
This formula should not be used in a computer program, however, as inaccuracy is occasionally introduced through rounding errors. (The first equation for should be used.)

(a) 置信区间 (a) Confidence interval

Pearson 相关系数 的抽样分布不是正态分布,但我们可以对 进行变换,得到一个称为 的量,它具有正态抽样分布。变换为
The sampling distribution of Pearson's is not Normal, but we can transform to get a quantity called which does have a Normal sampling distribution. The transformation is

z 的标准误约为 ,其中 是样本量,因此我们可以构建 95% 的 置信区间,区间范围为
The standard error of is approximately where is the sample size, so we can construct a 95% confidence interval for as being from

我们将上述值反变换,以获得总体相关系数 的置信区间,表达式为
We back- transform the above values to get a confidence interval for the population correlation coefficient as

图 11.1 中的 脂肪含量与年龄数据的相关系数为 0.7921,因此我们有
The fat and age data in Figure 11.1 had a correlation of 0.7921 so we have

我们可以通过计算获得 的 95% 置信区间
We can get a confidence interval for by calculating

以及
and

得到区间为 0.5710 到 1.5831。我们将这些值反变换,以获得 ( r ) 的 95% 置信区间,
giving 0.5710 to 1.5831. We back- transform these values to get a confidence interval for as

即 0.52 到 0.92。虽然整个置信区间远大于零,但区间非常宽。
or 0.52 to 0.92. Although the whole confidence interval is much greater than zero, it is very wide.

(b) 假设检验 (b) Hypothesis test

相关系数的假设检验可以非常容易地进行。在总体中无关联(即相关系数为零)的原假设下,可以证明以下量
The hypothesis test for the correlation coefficient may be performed very easily. Under the null hypothesis that there is no association in the population (i.e. zero correlation) it can be shown that the quantity

服从自由度为 ( n - 2 ) 的 ( t ) 分布。因此,可以通过查阅 ( t ) 分布表(表 B4)来检验无关联的原假设。
has a distribution with degrees of freedom. Thus the null hypothesis of no association may be tested by looking this value up in the table of the distribution (Table B4).

图 11.1 中脂肪百分比与年龄的数据相关系数为 0.7921,因此我们有
The fat and age data in Figure 11.1 had a correlation of 0.7921 so we have

在16个自由度上
on 16 degrees of freedom

然而,表B7显示了本身的临界值,这更易于使用。对于大多数实际目的,这个表格就足够了。
However, Table B7 shows critical values for itself, and this is much easier to use. This table will prove sufficient for most practical purposes.

11.7.2 秩相关 11.7.2 Rank correlation

斯皮尔曼秩相关系数通过将两个变量的数值按顺序排名得到。表11.4中展示了一个例子。计算最简单的方法是对数据的秩计算Pearson相关系数。对于表11.4中DZ双胞胎率与纬度的数据,这给出了
Spearman's rank correlation coefficient is obtained by ranking in order the values of each of the two variables. An example is shown in Table 11.4. The simplest way to get is to calculate Pearson's on the ranks of the data. For the data on DZ twinning rate and latitude in Table 11.4 this gives

还有一种更适合手工计算的替代方法,但它假设数据中没有并列排名。对每个被研究的个个体,计算排名差。斯皮尔曼秩相关系数由下式给出:
There is an alternative approach which is simpler for hand calculation, but it assumes that there are no ties in the data. For each of the subjects being studied the difference in the ranks, , is calculated. Spearman's rank correlation coefficient is then given by

这个公式与Pearson相关系数的公式没有明显相似之处,但在没有并列排名时,结果完全相同。
This formula bears no obvious similarity to the formula for Pearson's but gives the identical answer when there are no ties.

表11.4显示了纬度和DZ双胞胎率数据的秩。排名差的平方和为366.5,因此我们有
The ranks of the data on latitude and DZ twinning rate are shown in Table 11.4. The sum of the squares of the differences in the ranks is 366.5 so we have

虽然在数据中存在并列排名时,的计算应作调整,但除非并列排名数量较多,否则影响很小。表11.4中的纬度和DZ双胞胎数据有若干并列排名,但无论是否做修正,的值均为0.68(保留两位小数)。使用基于秩的Pearson相关系数的优点是自动处理了并列排名。此外,当然,使用标准统计软件计算也非常方便。
Although the calculation of should be modified when there are tied ranks in the data, the effect is small unless there are considerable numbers of ties. The latitude and DZ twinning data in Table 11.4 have several tied ranks but the value of is 0.68, to two decimal places, whether the correction is made or not. The advantage of the use of the Pearson correlation coefficient calculated on the ranks is that ties are automatically dealt with. Also, of course, it is easy to perform with standard statistical software.

(a)置信区间 (a) Confidence interval

对于样本量大于约10的情况, 的分布与 的分布相似,因此可以使用上述针对 的方法来获得 的置信区间。
The distribution of is similar to that of for samples larger than about 10, so a confidence interval for can be obtained using the method given above for

(b)假设检验 (b) Hypothesis test

在总体中无关联(即零相关)的原假设下,可以证明对于大样本(),量
Under the null hypothesis that there is no association in the population (i.e. zero correlation) it can be shown that for large samples the quantity

服从自由度为 分布。因此,可以通过查阅 分布表(表 B4)来检验无关联的原假设。同样,也可以将 与表 B7 中的临界值进行比较。对于较小的样本,应使用表 B8。
has a distribution with degrees of freedom. Thus the null hypothesis of no association may be tested by looking this value up in the table of the distribution (Table B4). Equivalently, can be compared with the critical values in Table B7. For smaller samples Table B8 should be used.

11.7.3 偏相关 11.7.3 Partial correlation

如果我们知道变量对之间的相关系数,比如 ,就可以计算调整第三个变量后的两个变量之间的相关性。为了调整变量 对变量 相关性的影响,我们计算调整变量 后的偏相关系数为
We can calculate the correlation between two variables after adjusting for a third if we have the correlation coefficients between each pair of variables, say , and . To adjust the correlation between variables and for the possible effect of variable we calculate the partial correlation of and adjusted for as

类似地,偏秩相关计算公式为
Similarly the partial rank correlation is calculated as

偏相关系数的假设检验与普通相关系数的检验方法相同,只是自由度为 。表11.4中变量对之间的相关系数(排除四个无牛奶消费率的国家)为
The hypothesis test for the partial correlation coefficient is performed in the same way as for the ordinary correlation coefficient, except that there are degrees of freedom. The correlations between pairs of variables in Table 11.4, omitting the four countries without milk consumption rates, were

因此,经调整牛奶消费后的纬度与DZ双胞胎率之间的偏秩相关系数为
so that the partial rank correlation coefficient between latitude and DZ twinning rate adjusted for milk consumption is

11.8 相关性的解释 11.8 INTERPRETATION OF CORRELATION

相关系数的取值范围在 之间,中点零表示两个变量之间没有线性关联。然而,相关系数很小并不一定意味着两个变量不相关。为了确认这一点,我们应当研究数据的散点图,因为两个变量可能表现出特殊的(即非线性)关系。例如,我们不会观察到平均正午温度与日历月份之间有明显的相关性,因为它们存在周期性模式。更常见的情况是曲线关系。
Correlation coefficients lie within the range to , with the midpoint of zero indicating no linear association between the two variables. A very small correlation does not necessarily indicate that two variables are not associated, however. To be sure of this we should study a plot of the data, because it is possible that the two variables display a peculiar (i.e. non- linear) relationship. For example, we would not observe much, if any correlation between the average midday temperature and calendar month because there is a cyclic pattern. More common is the situation of a curved

两个变量之间的关系,例如出生体重与妊娠期长度之间的关系。在这种情况下,Pearson的会低估这种关联,因为它衡量的是线性关联。秩相关系数更合适,因为它以更一般的方式评估变量是否倾向于同时上升(或朝相反方向移动)。
relationship between two variables, such as between birthweight and length of gestation. In this case Pearson's will underestimate the association as it is a measure of linear association. The rank correlation coefficient is better here as it assesses in a more general way whether the variables tend to rise together (or move in opposite directions).

令人惊讶的是,相关系数为0.5甚至0.7时,效果并不显著(见图11.2)。如表B7所示,这些大小的相关系数在样本量仅为16和9时,分别在5%的显著性水平下达到统计学显著性。它们是否重要则是另一回事。Feinstein(1985)曾评论,在超过6000的样本中发现的低于0.1的统计显著相关性缺乏临床意义。临床相关性的问题必须根据具体情况逐案判断,且依赖于研究背景。例如,同样的小相关系数在流行病学研究中可能重要,但在临床上则可能无关紧要。
It is surprising how unimpressive a correlation of 0.5 or even 0.7 is (Figure 11.2). As Table B7 shows, correlations of this magnitude are significant at the level in samples as small as 16 and 9 respectively. Whether they are important is quite another matter. Feinstein (1985) commented on the lack of clinical relevance of a statistically significant correlation of less than 0.1 found in a sample of over 6000. The problem of clinical relevance is one that must be judged on its merits in each case, and depends on the context. For example, the same small correlation may be important in an epidemiological study but unimportant clinically.

一种看待相关性的方法,有助于抑制过度热情,是计算 ,即由两个变量之间的关联“解释”的数据变异性的百分比。因此,相关系数为0.7意味着大约一半(49%)的变异性可以归因于观察到的关联,依此类推。这个概念与第11.13.6节和第12章中讨论的回归方差分析相吻合。计算相关系数的置信区间也可能有用,特别是对于小样本来说,置信区间会较宽。
One way of looking at the correlation that helps to modify over- enthusiasm is to calculate , which is the percentage of the variability of the data that is 'explained' by the association between the two variables. So a correlation of 0.7 implies that just about half (49%) of the variability may be put down to the observed association, and so on. This concept ties in with the analysis of variance for regression, discussed in section 11.13.6 and in Chapter 12. It may also be useful to calculate a confidence interval for the correlation coefficient, which for small samples will be wide.

关联的解释常常存在问题,因为不能直接推断因果关系。如果我们观察到两个变量 之间存在关联,有几种可能的解释。排除偶然发现的可能性后,可能是因为
Interpretation of association is often problematic because causation cannot be directly inferred. If we observe an association between two variables and there are several possible explanations. Excluding the possibility that it is a chance finding, it may be because

  1. A 影响(或“导致”)
  2. A influences (or 'causes') ;
  3. B 影响 ;或者
  4. B influences ; or
  5. 都受到一个或多个其他变量的影响。
  6. both and are influenced by one or more other variables.

当对某些疑似共同原因 有数据时,可以通过计算偏相关来观察在考虑 的情况下, 之间的观察关联是否仍然存在。除这一例外外,通常无法通过统计方法区分上述三种可能性,推断必须基于其他知识。对于缺乏背景知识的变量,推断因果关系是不合理的。无论观察到的关联强弱如何,这一点都适用。
Where data are available for some suspected common cause , it is possible to see if the observed association between and remains when allowing for by calculating the partial correlation. With this exception, it is not in general possible to distinguish statistically between the three possibilities above, and inferences must be based on other knowledge. When looking at variables where there is no background knowledge, inferring a causal link is not justified. This applies regardless of the strength of the observed association.

例如,我们并不惊讶于看到来自许多国家的数据表明酒精消费与肝硬化死亡率之间存在关系(Smith,1981),因为关于酒精对肝脏影响的科学知识已经十分丰富。但对于显示猪肉消费与肝硬化死亡率之间关系的国际数据,我们应如何解读?Nanji 和 French(1985)报道了这样的相关性。
For example, we are not surprised to see data from many countries that show a relation between consumption of alcohol and deaths from liver cirrhosis (Smith, 1981), because of the large body of scientific knowledge about the effect of alcohol on the liver. But what should we make of international data showing a relationship between pork consumption and cirrhosis mortality? Nanji and French (1985) reported such a correlation of

对16个国家的相关系数为,对10个加拿大省份的等级相关系数为0.60。
在没有任何科学依据支持这种关联的情况下,应对这种发现保持怀疑态度。
在可能的情况下,应尝试在不同的人群中检验相同的变量。
Seely(1985)研究了21个欧洲国家的猪肉消费与肝硬化死亡率之间的关系,其等级相关系数为0.001;完全没有关联。
for 16 countries and a rank correlation of 0.60 for 10 Canadian provinces. In the absence of any scientific reason for such an association one should be sceptical about such a finding. Wherever possible one should try to examine the same variables in a different population. Seely (1985) studied the relation between pork consumption and cirrhosis mortality in 21 European countries, for which the rank correlation was 0.001; there was no association at all.

国际间相关性的解释特别困难,因为各国之间存在诸多差异。我们不能安全地将图11.6中的数据解释为高纤维饮食导致初潮延迟(当然也不能反过来解释)。其他情况实际上也没有什么不同。无论假设多么“合理”,没有辅助证据,我们都不应将任何相关性视为因果关系。
Interpretation of international correlations is particularly difficult because there are so many differences between countries. We could not safely interpret the data of Figure 11.6 as indicating that a high fibre diet leads to delayed menarche (and certainly not the converse). Other situations are not really any different. We ought not to take any correlation as indicating a causal association without collateral evidence, however 'reasonable' the hypothesis may be.

相关性常被用作探索性方法,以调查多个变量之间的相互关系,为此最明显的做法是使用假设检验。虽然原则上没问题,但这种方法常被过度使用。问题在于,即使变量数量适中,相关系数的数量也很大—10个变量产生45个值,20个变量产生190个。仅凭偶然,约有1/20的相关系数会在5%显著性水平下显著,这些无法与真实关联区分开。此外,在5%显著性水平下显著的相关系数大小依赖于样本量。在大样本中,即使有几个相关系数约为0.2到0.3的显著值,这些相关性也不太可能非常有用。当确实没有先验假设时,这种观察大量变量的方法可能有帮助,但显著关联确实需要在另一组数据中得到确认,才能被信赖。
Correlation is often used as an exploratory method for investigating inter- relationships among many variables, for which purpose it is most obvious to use hypothesis tests. Although fine in principle, this approach is often over- done. The problem is that even with a modest number of variables the number of correlation coefficients is large - 10 variables yield values, and 20 variables yield 190. One in 20 of these will be significant at the level purely by chance, and these cannot be distinguished from genuine associations. Further, the magnitude of correlation that is significant at the level depends upon the sample size. In a large sample, even if there are several significant values of around 0.2 to 0.3, say, these are unlikely to be very useful. While this way of looking at large numbers of variables can be helpful when one really has no prior hypotheses, significant associations really need to be confirmed in another set of data before credence can be given to them.

另一个常见的解释问题出现在我们知道两个变量各自都与第三个变量有关联的情况下。例如,如果 呈正相关,且 也呈正相关,人们容易推断 必然与 也呈正相关。虽然这可能是真的,但这种推断是不合理的—我们无法对 之间的相关性做出任何结论。当观察到无关联时,情况也是如此。例如,在 Mazess 等人(1984年)的数据中,年龄与体重的相关系数为0.05,体重与脂肪百分比的相关系数为0.03(见图11.3)。这并不意味着年龄与脂肪百分比的相关系数也接近于零。事实上,如前所述,该相关系数为0.79(见图11.11)。这三组两变量关系见图11.8。相关性不能从间接关联中推断得出。
Another common problem of interpretation occurs when we know that each of two variables is associated with a third variable. For example, if is positively correlated with and is positively correlated with it is tempting to infer that must be positively correlated with . Although this may indeed be true, such an inference is unjustified - we cannot say anything about the correlation between and . The same is true when one has observed no association. For example, in the data of Mazess et al. (1984) the correlation between age and weight was 0.05 and between weight and fat it was 0.03 (Figure 11.3). This does not imply that the correlation between age and fat was also near zero. In fact this correlation was 0.79, as we saw earlier (Figure 11.11). These three two- way relations are shown in Figure 11.8. Correlations cannot be inferred from indirect associations.

相关性常被使用,但实际上更适合使用第11.10节及以后讨论的回归方法。两种方法的比较见第11.17节。
Correlation is often used when it would be better to use regression methods, discussed in section 11.10 onwards. The two methods are compared in section 11.17.


图11.8 散点图显示了18名正常成人(Mazess等,1984年)年龄、脂肪百分比和体重之间的两两关系。
Figure 11.8 Scatter diagrams showing each two way relation between age, fat, and weight of 18 normal adults (Mazess et al., 1984).

11.9 相关性的呈现 11.9 PRESENTATION OF CORRELATION

在可能的情况下,展示数据的散点图是有用的。在此类图中,使用不同符号表示不同类别的观测值通常很有帮助,例如用以区分患者性别。
Where possible it is useful to show a scatter diagram of the data. In such a graph it is often helpful to indicate different categories of observations by using different symbols, for example to indicate patients' sex.

相关系数的值应保留两位小数,并且如果进行了显著性检验,应同时给出值。还应说明观测值的数量。
The value of should be given to two decimal places, together with the value if a test of significance is performed. The number of observations should be stated.

如果需要显示一组变量中所有变量对之间的相关性,可以通过相关矩阵来实现,如表11.3所示。在该表中,相关系数以三角形排列显示,类似于路图中显示各城镇之间距离的图表。其图形等价物见图11.8和12.2,效果更佳。
If it is necessary to display the correlations between all pairs of a set of variables this can be done by means of a correlation matrix, as in Table 11.3. In this the correlation coefficients are shown in a triangular display similar to charts in road atlases showing the distances between each pair of towns. The graphical equivalent, shown in Figures 11.8 and 12.2, is even better.

11.10 回归 11.10 REGRESSION

当我们拥有两组连续变量的数据时,可能会出现其他问题。特别是我们可能希望描述它们之间的关系,从而能够在只知道一个变量的情况下预测另一个变量的值。显然,相关系数无法完成这些功能;它仅以一个数字表示关联的强度。我们需要一种方法来描述两个变量值之间的关系,对于这个普遍问题,我们需要使用称为回归的技术。
Other questions may arise when we have a set of data on two continuous variables. In particular we might wish to describe the relation between them, and thus be able to predict the value of one variable for an individual when we only know the other variable. Clearly the correlation coefficient does not perform these functions; it just indicates the strength of the association as a single number. We want a way of describing the

本章我将只考虑两个变量的简单情况;扩展内容将在第12章和第14章讨论。我将只考虑我们通常关注的两个变量之间的线性(直线)关系。
relation between the values of the two variables, and for this general problem we need the technique called regression. In this chapter I shall consider just the simple case where we have two variables; extensions are discussed in Chapters 12 and 14. I shall consider only the common case where we are interested in a linear (straight- line) relationship between two variables.

表11.6和图11.9展示了从24名1型糖尿病患者收集的数据。变量包括空腹血糖(mmol/l)和通过超声心动图测得的左心室环向缩短平均速度(Vcf)。其中一名患者的Vcf未被记录。如果我们希望通过血糖预测Vcf,那么与相关性不同,我们并不认为两者之间的关系是对称的。
Table 11.6 and Figure 11.9 show data collected from 24 type 1 diabetic patients. The variables are fasting blood glucose (mmol/l) and mean velocity of circumferential shortening of the left ventricle (Vcf) derived from echocardiography. One patient's Vcf was not recorded. If we are interested in trying to predict Vcf from blood glucose, then, unlike the case for correlation, we do not have a symmetric relation between the two

表11.6 24例1型糖尿病患者的数据(Thuesen等,1985年)
Table 11.6 Data from 24 type 1 diabetic patients (Thuesen et al., 1985)

患者空腹血糖(mmol/l)平均环向缩短速度(Vcf)(%/秒)
115.31.76
210.81.34
38.11.27
419.51.47
57.21.27
65.31.49
79.31.31
811.11.09
97.51.18
1012.21.22
116.71.25
125.21.19
1319.01.95
1415.11.28
156.71.52
168.6-
174.21.12
1810.31.37
1912.51.19
2016.11.05
2113.31.32
224.91.03
238.81.12
249.51.70
PatientFasting blood glucose (mmol/l)Mean circumferential shortening velocity (Vcf) (%/sec)
115.31.76
210.81.34
38.11.27
419.51.47
57.21.27
65.31.49
79.31.31
811.11.09
97.51.18
1012.21.22
116.71.25
125.21.19
1319.01.95
1415.11.28
156.71.52
168.6-
174.21.12
1810.31.37
1912.51.19
2016.11.05
2113.31.32
224.91.03
238.81.12
249.51.70


图11.9 空腹血糖与左心室环向缩短平均速度(Vcf)之间的关系。数据来自23例1型糖尿病患者(Thuesen等,1985年)。
Figure 11.9 Relation between fasting blood glucose and mean velocity of circumferential shortening of the left ventricle (Vcf). Data from 23 type 1 diabetics (Thuesen et al., 1985).

变量。我们可以将这些视为响应变量(或结果变量)(Vcf)和预测变量(血糖)。这两者通常分别称为因变量和自变量,这两个术语容易混淆,但其含义是指一个变量依赖于另一个变量。响应变量总是绘制在垂直轴,即 轴上,预测变量绘制在水平轴,即 轴上,如图11.9所示。
variables. We may consider these as a response (or outcome) variable (Vcf) and a predictor variable (blood glucose). These are often called dependent and independent variables respectively, confusing names which indicate which variable is depending on the other. The response variable is always plotted on the vertical, or , axis and the predictor variable on the horizontal, or , axis, as illustrated in Figure 11.9.

问题在于如何拟合一条直线,使其在某种意义上对任意 值给出对 的“最佳”预测。直观上,这条线应使数据点与拟合线之间的距离最小化。对此问题有多种方法,但标准方法称为最小二乘回归。当我们使用该方法拟合回归线时,目标是最小化观测点到回归线的垂直距离的平方和。图11.10展示了相同数据及其最小二乘回归线,并标出了各点到线的垂直距离。每个距离是个体观测值与回归线给出的拟合值之间的差异。该距离的专业术语是残差,本文后续将使用该词。注意,这种方法的解不依赖于图形的比例尺。如果改用垂直线的垂直距离(另一种可能),解将依赖于图形的绘制方式,这显然是不理想的。
The problem is to fit a straight line to the data that in some sense gives the 'best' prediction of for any value of . Intuitively this will be a line that minimizes the distance between the data and the fitted line. There are several possible approaches to this problem, but the standard method is called least squares regression. When we use this method to fit a regression line we minimize the sum of the squares of the vertical distances of the observations from the line. Figure 11.10 shows the same data with the least squares regression line, together with the vertical distances from the line. Each distance is the difference for an individual between the observed value and the value given by the line, known as the fitted value. The technical term for this distance is a residual, a term I shall use from now on. Notice that this approach gives a solution that does not depend on the scaling of the graph. If we were to take distances perpendicular to the line (which is an alternative possibility) the solution would depend on the way the graph was drawn, which is clearly an undesirable feature.


图11.10 图11.9的数据及其回归线,显示观测值与拟合值之间的差异。
Figure 11.10 Data of Figure 11.9 with regression line, showing differences between observed and fitted values.

最小二乘法产生的回归线使残差平方和最小,因此也最小化了残差的方差,即残差平方和除以观测数减二。该方差称为残差方差,是衡量拟合优度的重要指标。残差方差在回归分析结果评估中极为重要。
The least squares method produces the line that minimizes the sum of the squares of the residuals, and so it also minimizes the variance of the residuals, which is just the sum of squares divided by the number of observations minus two. This variance, known as the residual variance, is a measure of the 'goodness- of- fit' of the line. The residual variance is very important when assessing the results of a regression analysis.

如果我们有两个变量 (血糖)和 (Vcf)的观测值,可以进行“ 关于 的回归”,得到一条直线,为任意 值提供 的“拟合”估计值。回归线的一般方程为:
If we have observed values of two variables, (blood glucose) and (Vcf), we can perform a 'regression of on ' to derive a straight line that gives a 'fitted' estimated value of for any value of the variable . The general equation of a regression line is

其中 是斜率, 称为截距,因为它是回归线与 轴的交点处的拟合值,即 时的 值。在大多数医学应用中, 的实际意义不大,因为 变量通常不可能接近零,例如血压或身体尺寸测量值。
Here is the slope of the line and is called the intercept because it is the fitted value of where the line crosses the axis, for which . In most medical applications the value of will have no practical meaning, as the variable cannot be anywhere near zero; examples are blood pressure and any measurements of body size.

实际上,对于给定数据集计算 是很容易的(参见11.13节),尽管使用计算机进行计算显然更为合适。对于糖尿病患者的数据,图11.10中回归线的方程为
In practice the calculation of and for a given set of data is easy (see section 11.13) although it is definitely preferable to use a computer to do the calculations. For the data on diabetics the equation of the regression line shown in Figure 11.10 is

这个方程告诉我们什么?对于任意血糖值,从回归方程得到的 Vcf 估计值即为预测的 Vcf 值,但我们需要对这种预测的不确定性有一定的度量。更根本地,我们通常希望考虑两个变量之间观察到的关系是否只是偶然发现,并评估拟合直线的优劣。所有这些方面都可以通过之前介绍的残差来研究。
What does this equation tell us? For any value of blood glucose the estimate of Vcf derived from the regression equation is the predicted value of Vcf, but we need some measure of the uncertainty of such a prediction. More basically, we would usually wish to consider the possibility that the observed relation between the two variables in these subjects is just a chance finding, and to consider how well the line fits the data. All of these aspects can be studied in relation to the residuals introduced earlier.

11.10.1 假设条件 11.10.1 Assumptions

在考虑回归分析的使用之前,重要的是要考虑该方法所依赖的三个假设:
Before we can consider the use of a regression analysis it is important to consider three assumptions that underlie the method:

【1】因变量 (本例中的 Vcf)对于预测变量 的每一个取值应服从正态分布;2. 的变异性(用方差或标准差衡量)应在每个 值处保持一致;3. 两变量之间的关系应为线性。

  1. the values of the outcome variable (Vcf in our example) should have a Normal distribution for each value of the predictor variable ;2. the variability of , as assessed by the variance or standard deviation, should be the same for each value of ;3. the relation between the two variables should be linear.

与相关分析不同,回归分析不要求两个变量均为随机变量:如果预测变量 的取值是由实验者选择的(如有时的情况),回归分析仍然有效。且 的取值不必近似正态分布。
Unlike for correlation, it is not a requirement that both variables should be random variables: regression analysis is valid if the values of the predictor variable have been chosen by the experimenter, as is sometimes the case. Nor do the values of need to be approximately Normal.

通常,我们可以通过散点图对数据是否明显偏离上述三个条件有一个直观的判断。幸运的是,拟合回归线后可以详细评估这些条件。同样,残差包含了相关信息。
We can usually get a reasonable visual impression of whether the data deviate considerably from the three conditions listed above from a scatter diagram. Fortunately it is possible to assess them in detail after fitting the regression line. Again the residuals contain the relevant information.

如果上述三个假设成立,则残差应服从均值为零的正态分布。如果将残差绘制对 值的散点图,点应均匀分布于所有 值处。我建议常规制作此图。图11.11展示了残差图的三种可能情况:(a) 假设成立;(b) 随着 增大,残差变异性增加;(c) 残差与 值呈曲线关系。图 (b) 表明数据 可能需要对数转换,图 (c) 表明 之间存在非线性关系(见11.12.2节)。有时不同问题会同时出现,此时对 变量进行对数或其他转换可一次性解决所有问题。
If the three above assumptions hold then the residuals should have a Normal distribution (with a mean of zero). If we plot the residuals against the values the points should be evenly scattered at all values. I recommend that this plot is produced routinely. Figure 11.11 shows three possibilities for a plot of the residuals where (a) the assumptions are met; (b) the residuals have increasing variability as increases; (c) there is a curved relation between the residuals and the values. Plot (b) suggests that the data might need log transformation, and plot (c) indicates a non- linear relation between and (see section 11.12.2). It can happen that different problems occur simultaneously, and that log (or some other) transformation of the variable will solve all the problems at once.

正态性假设可以通过残差的正态概率图进行正式评估(见7.5.3节)。部分计算机程序已集成此分析。
The assumption of Normality can be assessed formally by means of a Normal plot of the residuals (see section 7.5.3). Some computer programs incorporate this analysis.

图11.12显示了残差与血糖的散点图,其形态与图11.11(a)相当令人满意。图11.13显示了残差的正态概率图,图形相当接近直线。然而,Shapiro-Francia
Figure 11.12 shows the residuals plotted against blood glucose, which looks satisfactorily like Figure 11.11(a). Figure 11.13 shows a Normal plot of residuals, which is reasonably straight. However, the Shapiro- Francia


图11.12 回归线(图11.10所示)残差与血糖的散点图。
Figure 11.12 Residuals from the regression line shown in Figure 11.10, plotted against blood glucose.


图11.13 图11.12中残差的正态概率图。
Figure 11.13 Normal plot of residuals shown in Figure 11.12.

检验给出 ,表明存在一定程度的非正态性。图11.13表明这不是一个严重的问题,但如果担心的话,可以尝试对Vcf进行对数转换。使用 Vcf进行回归分析后的残差的 值为
test gives , indicating some non- Normality. Figure 11.13 suggests that this is not a major problem, but if we were worried we could try log transformation of Vcf. The value of for the residuals after regression analysis using Vcf is .

11.11 回归的应用 11.11 USE OF REGRESSION

图11.10中所示的最小二乘回归线方程为
The least squares regression line shown in Figure 11.10 has the equation

从图11.10和11.13来看,该分析的假设似乎合理—回归线周围的散点较为均匀且对称,线性关系看起来可信,且残差的分布与正态分布相差不大。
From Figures 11.10 and 11.13 the assumptions of this analysis seem reasonable - the scatter around the regression line is fairly even and symmetric, a linear relation seems plausible, and the residuals have a distribution that is not too far from Normal.

当假设成立时,回归线可被视为连接每个X值对应的Y均值。因此,回归线给出了特定血糖水平下平均Vcf的估计。拟合的样本数据线是对总体变量关系的估计,因此我们应考虑该估计线的不确定性。图11.14显示了回归线及其95%置信区间。我们可以认为该区间以95%的概率包含真实关系。或者,对于任一血糖值,该置信区间覆盖了我们95%置信包含总体真实平均Vcf的值域。置信区间在平均血糖值(10.3 mmol/l)处最窄,随着距离均值的增大而变宽。
When the assumptions hold, the regression line can be thought of as joining the mean values of for each value of . Hence the regression line gives an estimate of the average Vcf for a given blood glucose level. The line fitted to the sample data is an estimate of the relation between these variables in the population, so we should consider the uncertainty of this estimated line. Figure 11.14 shows the regression line together with the confidence interval for the line. We can consider this interval as including the true relation with probability. Alternatively, for any value of blood glucose the confidence interval covers the range of values which we are confident includes the true mean Vcf in the population. The confidence interval is narrowest at the mean blood glucose and gets wider with increasing distance from the mean.

斜率是回归分析中最重要的参数,因为它反映了两个变量之间关系的强度。拟合线的斜率为0.022,意味着我们估计每增加1单位(即1 mmol/l)空腹血糖,Vcf平均增加0.022%。斜率的标准误为0.0105。估计斜率 的处理方式类似于样本均值。我们可以计算斜率的置信区间,并检验斜率为零的假设,即Vcf与血糖无关系。第11.13节描述的计算得出斜率的95%置信区间为0.000至0.044;检验统计量为 。按照常规标准,斜率刚好显著不同于零。通常置信区间更具信息量,显示数据既兼容Vcf与血糖无关系,也兼容关系强度是观察值两倍的情况。
The slope is the parameter of main interest in a regression analysis, as it indicates the strength of the relationship between the two variables. The slope of the fitted line is 0.022, which means that we estimate an increase in Vcf of per second for every increase of one unit (i.e. 1 mmol/l) in fasting blood glucose. We can calculate the standard error of the slope, which is 0.0105. The estimated slope, , is treated in much the same way as the mean of a sample. We can calculate a confidence interval for the slope, and can test the hypothesis of a zero slope, that is, of no relationship between Vcf and blood glucose. These calculations, which are described in section 11.13, yield a confidence interval for the slope from 0.000 to 0.044; the test statistic is , with . By conventional criteria the slope is just significantly different from zero. As usual the confidence interval is more informative, showing that the data are compatible with no relation between Vcf and blood glucose or with one that is twice as strong as the observed one.

这项分析隐含了关系的一致性,这可以从观察数据围绕拟合直线的散布情况看出。点越接近直线,直线的置信区间就越窄(图11.14)。
Implicit in this analysis is the consistency of the relationship, as indicated by the scatter of the observed data around the fitted line. The nearer the

在当前数据中,散布较大,如果考虑用已知空腹血糖水平预测新个体的Vcf,这种散布更加明显。
points are to the line the narrower will be the confidence interval for the line (Figure 11.14). With the present data there is considerable scatter, and this is more noticeable if we consider the prediction of Vcf for a new subject with a known fasting blood glucose level.

图11.14显示了给定空腹血糖值时,Vcf均值的95%置信区间。预测个体的Vcf时不确定性更大,图11.15显示95%预测区间确实宽得多。对于任何血糖值,我们期望95%的未来个体的Vcf值落在该区间内。因此,个体Vcf有95%的概率位于此区间内,尽管我们最佳估计是对应其血糖水平的回归线上的值。95%预测区间随血糖均值距离的增大而变宽,虽然不易察觉。显然,对于给定血糖值,估计的Vcf存在极大不确定性。若要使该关系具备临床价值,需要更紧凑的预测区间。注意,与回归线置信区间不同,预测区间仅能通过增加样本量略微缩小,因为预测区间主要反映个体围绕拟合线的变异性,与样本量无关。在测量不精确(如血压)的情况下,可以通过对每个个体取两次(或多次)读数的平均值来缩小预测区间。
Figure 11.14 showed the confidence interval for the mean Vcf for a given value of fasting blood glucose. We expect greater uncertainty when trying to predict Vcf for an individual, and Figure 11.15 shows that the prediction interval is indeed much wider. For any value of blood glucose we would expect of future subjects to have Vcf values between the values shown. There is thus a probability of an individual's Vcf being within this interval, although our best estimate is given by the value on the regression line corresponding to their blood glucose level. The prediction interval also widens with distance from the mean blood glucose level although this is not as easy to see. What is clear is that for a given blood glucose value there is enormous uncertainty attached to the estimated Vcf. A much tighter prediction interval is needed for such a relation to have any clinical value. Note that unlike the confidence interval for the regression line the prediction interval can be made only slightly narrower by increasing the sample size. This is because the prediction interval mainly reflects individual variability about the fitted line, which has nothing to do with sample size. Where the measurements are imprecise (such as blood pressure) the prediction interval can be


图11.14 类似图11.10,但显示回归线的95%置信区间。
Figure 11.14 As Figure 11.10, but showing the confidence interval for the regression line.


图11.15 类似图11.10,但显示用于预测单个个体血糖对应Vd的95%区间。
Figure 11.15 As Figure 11.10, but showing the interval for predicting Vd from blood glucose for an individual subject.

通过对每个个体取两次(或多次)读数的平均值,可以缩小预测区间。
narrowed by taking the average for each individual of two (or more) readings.

拟合回归线解释了因变量变异的一部分,残差表示未解释的变异量。回归分析可用方差分析表展示,类似第9章所示。表11.7显示了血糖对Vcf回归的方差分析表。该表的推导见11.13.6节。许多软件包以此方式呈现结果。此格式易于扩展到复杂模型,详见第12章。需注意两点:首先,作为同一分析的另一种展示方式,值与斜率的值相同。实际上,统计量是之前统计量的平方。其次,残差均方(0.0470)是残差的方差,即残差标准差的平方。
The fitted regression line explains a proportion of the variability in the dependent variable , and the residuals indicate the amount of unexplained variability. A regression analysis can thus be displayed as an analysis of variance table which is very similar to those shown in Chapter 9. Table 11.7 shows the analysis of variance table corresponding to the regression of blood glucose on Vcf. The derivation of this table is explained in section 11.13.6. Many software packages present the results in this way. This format for displaying the results of a regression analysis extends easily to complex models, as will be seen in Chapter 12. Two points should be noted. Firstly, as this is an alternative way of displaying the same analysis, the value is the same as that obtained for the slope. In fact, the statistic is the square of the statistic obtained earlier . Secondly, the residual mean square (0.0470) is the variance of the residuals, and thus the square of the residual standard deviation.

残差标准差表示回归线未解释的变异,是拟合优度的测量,单位与测量单位相同。更通用的拟合优度评估方法是考虑模型解释的总变异比例,通常通过回归解释的平方和占总平方和的百分比来衡量。根据表11.7,该值为,即17%。该统计量称为
The residual standard deviation indicates the variation not explained by the regression line so it is a measure of the goodness- of- fit of the line in the units of measurement. A more general way of assessing goodness- of- fit is to consider the proportion of the total variation explained by the model. This is usually done by considering the sum of squares explained by the regression as a percentage of the total sum of squares. From Table 11.7 this value is or . This statistic is called , and

表11.7 血糖对Vcf回归对应的方差分析表
Table 11.7 Analysis of variance table corresponding to regression of blood glucose on Vcf

变异来源自由度平方和均方F值P值
回归10.20730.20734.410.048
残差210.98610.0470
总计221.1934
Source of variationDegrees of freedomSums of squaresMean squaresFP
Regression10.20730.20734.410.048
Residual210.98610.0470
Total221.1934

Vcf 与血糖之间相关系数的平方。该概念可扩展至更复杂的模型,下一章将再次讨论。这里较低的 值表明,尽管斜率在统计学上显著,但 Vcf 的大部分变异性并未被血糖水平的变化所解释。
the square of the correlation coefficient between Vcf and blood glucose. The concept extends to more complex models, and will be discussed again in the next chapter. The low value of here indicates that despite the statistically significant slope the majority of the variability in Vcf is not explained by variation in blood glucose levels.

第11.13节给出了与回归相关的所有计算的数学公式。
Section 11.13 gives the mathematical formulae for all the calculations relating to regression.

11.12 扩展 11.12 EXTENSIONS

前面几节介绍了最简单形式的回归分析,即希望描述单一样本中两个连续变量之间的线性关系。各种扩展是可能的,下面描述了其中两种。它们都是多元回归的类型,可以同时考察一个结果变量对两个或更多其他变量的依赖关系。多元回归将在第12章中更详细讨论。
The previous sections have presented the simplest form of regression analysis, where we wish to describe the linear relation between two continuous variables measured in a single sample. Various extensions are possible, two of which are described below. They are both types of multiple regression, whereby we can examine the dependence of one outcome variable on two or more other variables simultaneously. Multiple regression is discussed at more length in Chapter 12.

11.12.1 组间比较 11.12.1 Comparing groups

如果我们有来自两个组的受试者数据,可以分别拟合回归线,然后比较两条线的斜率,看它们是否大致相似。可以获得差异的置信区间或进行显著性检验。如果两条线可视为具有相同斜率,则可以拟合两组数据具有相同斜率(即平行)的回归线。两条线之间的垂直距离即为调整了 变量分布差异后,两组中 变量均值的差异。这种分析称为协方差分析。更多细节见 Altman 和 Gardner (1989)。协方差分析(第12.4.1节也有讨论)可扩展到超过两组的观察数据。
If we have data from two groups of subjects we can fit regression lines to each, and then compare the slopes of the two lines to see if they are reasonably similar. A confidence interval can be obtained for the difference or a significance test can be carried out. If the two lines can be considered to have the same slope, then it is possible to fit lines to the two sets of data that have the same slope (i.e. they are parallel). The vertical distance between the two lines is then the difference in the means of the variable in the two groups adjusted for any difference in the distribution of the variable. This analysis is known as analysis of covariance. Further details are given by Altman and Gardner (1989). Analysis of covariance, which is also discussed in section 12.4.1, can be extended to more than two groups of observations.

11.12.2 非线性关系 11.12.2 Non-linear relationships

有时从散点图中可以清楚看到两个变量之间的关系是曲线型的。有几种统计模型可用于处理非线性。最简单且本文唯一考虑的方法称为多项式回归。
Sometimes it can be clearly seen from a scatter plot that the relation between two variables is curved. There are several statistical models that can be used to cope with non- linearity. The simplest method, and the only one considered here, is known as polynomial regression.

多项式回归是多元回归的特例,用于描述(或“建模”)一个结果变量与单一预测变量之间的非线性关系。变量 之间的线性关系对应回归方程 。该思路可通过模型 扩展到非线性关系。该模型称为二次曲线,认为结果变量 不仅依赖于预测变量 ,还依赖于其平方 。通过这种方式,我们得到 之间的曲线关系,尽管(如上所述)该模型是多元回归的特例,预测变量为
Polynomial regression is a special case of multiple regression when we wish to describe (or 'model') the non- linear relation between an outcome variable and a single predictor variable. A linear relation between variables and leads to a regression equation of the type . This idea can be extended to a non- linear relation by means of the model . This model, which is called a quadratic curve, consid- . ers the outcome variable to be dependent not just on the predictor variable but also on its square . By this means we obtain a curved relation between and , although (as explained above) this model is a special case of multiple regression, with both and as predictor variables.

二次模型描述了一个简单的曲线,先上升后下降(或反之),围绕其最大值(或最小值)对称。Altman 和 Coles (1980) 将该模型拟合于不同孕周对应的平均出生体重数据。例如,对于女性头胎婴儿,他们拟合的模型是
The quadratic model describes a simple curve which rises and then falls (or vice versa) in a symmetric manner about its maximum (or minimum) value. Altman and Coles (1980) fitted such a model to data giving mean birthweight for different lengths of gestation. For example, for female first born babies their fitted model was

出生体重 ,其中“age”是孕周数。该曲线见图11.16。
Birthweight where 'age' is the gestational age in weeks. This curve is shown in Figure 11.16.

11.13 回归—数学原理及实例 11.13 REGRESSION - MATHEMATICS AND WORKED EXAMPLE

(本节提供了第11.10至11.12节中描述计算的数学公式及一个实例。可省略此节而不影响连贯性。由于这些公式使用较为复杂,建议尽可能使用计算机程序进行回归分析。)
(This section gives the mathematical formulae for the calculations described in sections 11.10 to 11.12 together with a worked example. It can be omitted without loss of continuity. These formulae can be complicated to use, so it is preferable to perform regression analysis using a computer program if possible.)

回归分析将以表11.6中糖尿病患者的数据为例。我们希望回归线能根据血糖预测Vcf(环向缩短速度),因此(预测变量)是血糖,(结果变量)是Vcf。
Regression analysis will be illustrated using the data from diabetics shown in Table 11.6. We want the regression line to allow prediction of Vcf (velocity of circumferential shortening) from blood glucose, so the (predictor) variable is blood glucose and the (outcome) variable is Vcf.

11.13.1 回归线 11.13.1 The regression line

最小二乘线性回归方程为 ,且
The equation of the least squares linear regression line is and


图11.16 孕周与平均出生体重的二次曲线拟合(Altman 和 Coles,1980)。
Figure 11.16 Quadratic curve fitted to mean birth weight by gestational age (Altman and Coles, 1980).

参数的估计值易于获得。设观测数据为 ),可证明回归线必须经过数据均值点。估计的斜率由下式给出:
estimates of and can be obtained easily. Denoting the observed data as and it can be shown that the line must pass through the mean of the data . The estimated slope is given by

注意,正如分析的性质所预期的,这个方程是不对称的,与第11.7节中给出的相关系数的公式不同:变量的选择是有区别的。
Note that, as we should expect from the nature of the analysis, the equation is asymmetric in contrast to that for given in section 11.7: it does matter which variable is and which is .

如果我们先计算出 值相对于它们均值的“平方和”以及“乘积和”,计算过程可以简化:
The calculations can be simplified if we first obtain the 'sum of squares' of the and values about their means, and the 'sum of products':

分别是 方差的 倍。计算斜率 的更简便方法是
The quantities and are just times the variances of and . An easier way of calculating is as

但该公式不应在计算机程序中使用,因为舍入误差可能导致不准确。用于此目的应仅使用上面给出的第一个 的公式。
This formula should not be used in a computer program, however, as inaccuracy is occasionally introduced because of rounding errors. Only the first equation given above for should be used for this purpose.

因为回归线经过均值点 ,所以截距 可以简单估计为
Because we know that the regression line passes through the mean , we can estimate simply as

因此,对于任意 值,比如 ,由方程预测的 拟合值为
So for any value of , say , the fitted value of predicted by the equation is

注意,下面引用的所有结果均采用了完全的数值精度,但中间计算结果已四舍五入以便于展示。
Note that all the results quoted below were obtained using full numeric accuracy, but intermediate calculations have been rounded to clarify the presentation.

对于糖尿病患者的数据,两个变量的均值分别为 ,我们还需要以下量:
For the data on diabetics the mean values of the two variables are and , and the other quantities we will need are

斜率 的估计值为
We estimate the slope as

截距 的估计值为
The intercept is estimated as

11.13.2 残差变异 11.13.2 Residual variation

观察值 与拟合值 之间的差异为
The difference between an observed value and fitted value is thus

其中 即为该个体的残差。最小二乘法拟合的直线是通过最小化残差平方和 实现的,但我们更关心的是残差的方差,其计算公式为
and the value is the residual for that individual. It is the sum of the squares of the residuals, , that is minimized by the least squares line, but we are more interested in their variance, obtained as

或者,计算时用
or, for calculation,

该表达式的平方根,即残差标准差 ,将在后续计算中使用。
The square root of this expression, the residual standard deviation, , is used in subsequent calculations.

在本例中,残差方差计算为
We can calculate the residual variance in the example as

使得残差标准差为
so that the residual standard deviation is

11.13.3 置信区间 11.13.3 Confidence intervals

(a) 斜率 (a) Slope

斜率 的标准误与残差标准差密切相关,其计算公式为
The standard error of the slope, , is strongly related to the residual standard deviation, being

因此, 置信区间为
so that a confidence interval for is

其中 服从自由度为 的 t 分布(即与残差相关的自由度)。
where is on degrees of freedom (the degrees of freedom associated with the residual).

斜率通常是最感兴趣的参数。 的标准误为
The slope is usually the aspect of most interest. The standard error of is

根据表B4,21个自由度下的 值为2.08,因此95%的置信区间为
From Table B4 the value of on 21 degrees of freedom is 2.08, so a confidence interval is given by

即从0.00012到0.044。置信区间从零开始,零表示变量间无关系,到样本中观察值的两倍。
that is, 0.00012 to 0.044. The confidence interval thus extends from zero, representing no relation between the variables, to twice the value observed in the sample.

(b) 给定 时的估计 (b) Estimated Y for a given

对于给定的 值,假设为 ,估计值 的标准误差为
The standard error of the estimate for a given value of , say ,is given by

置信区间为
and a confidence interval is given by

其中 服从自由度为 的 t 分布。
where is on degrees of freedom.

我们可以获得任意血糖水平下预测的 Vcf 均值的 95% 置信区间。如果 是回归方程预测的 Vcf 均值,则 的标准误为
We can obtain a confidence interval for the predicted mean value of Vcf for any blood glucose. If is the predicted mean Vcf from the regression equation, then the standard error of is

其中 是血糖值。因此,对于血糖为 ,回归方程估计的均值 Vcf 为
where is the blood glucose value. So for a blood glucose of the estimated mean Vcf is given by the regression equation as

该估计值的标准误为
The standard error of this estimate is thus

我们利用上述公式计算估计值 1.419 的置信区间。根据表 B4,21 自由度下 值为 2.080,因此 95% 置信区间为
We use the equation above to get a confidence interval for the estimate of 1.419. From Table B4 the value of with 21 degrees of freedom is 2.080, so that the confidence interval is given by

即 1.29 到 /秒。
or 1.29 to /sec.

(c) 截距 (c) Intercept

截距通常不太引人关注,但可以使用上一节中的公式为截距 计算置信区间,即当 的置信区间。
The intercept is not usually of great interest, but a confidence interval can be obtained for the intercept using the formula in the previous section to get a confidence interval for when .

11.13.4 预测区间 11.13.4 Prediction interval

由于个体数据围绕拟合线的散布更为直接相关, 预测区间比拟合线的 置信区间要宽得多。
The prediction interval is much wider than the confidence interval for the line as the scatter of the individual data about the fitted line becomes more directly relevant.

对于任意值 ,预测值为 。为了得到预测区间,我们不需要 的标准误,而是该 值处个体 值的估计标准差。该标准差由下式给出:
For any value the predicted value is . To get the prediction interval we do not want the standard error of , but the estimated standard deviation of individual values of at that value of . This standard deviation is given by

因此, 预测区间为
and thus the prediction interval is

其中 服从 自由度的 t 分布。
where is on degrees of freedom.

对于血糖值为 的个体,其 Vcf 值的估计标准差为
The estimated standard deviation of Vcf values for individuals with a blood glucose of is

因此, 预测区间为
The prediction interval is therefore

,这明显比均值的置信区间宽得多,可以通过比较图11.14和图11.15看出。
or to , which is considerably wider than the confidence interval for the mean, as can be seen by comparing Figures 11.14 and 11.15.

11.13.5 斜率 的假设检验 11.13.5 Hypothesis test for

我们已经看到估计斜率 的标准误为 ,因此我们可以通过计算 来检验假设 。该比值与自由度为 分布进行比较。
We have seen that the standard error of the estimated slope, , is , so we can perform a test of the hypothesis that by calculating . This ratio is compared with the distribution with degrees of freedom.

因此,我们可以检验无关假设,即 Vcf 与
We can thus test the null hypothesis of no relation between Vcf and

血糖。我们只需将估计的 除以其标准误,然后将结果与相应的 分布值进行比较。因此我们有
blood glucose. We simply divide the estimate of by its standard error and compare the result with the appropriate value of the distribution. So we have

该值与自由度为21的分布进行比较,的值为2.08。因此,斜率在5%的显著性水平上刚好显著地不同于零。
This value is compared with the distribution with 21 degrees of freedom, the value of being 2.08. The slope is thus just significantly different from zero at the level.

11.13.6 方差分析表 11.13.6 Analysis of variance table

回归分析的结果可以通过方差分析表来展示,将因变量的总变异性分解为由回归线解释的部分和未解释的残差变异部分。
The results of a regression analysis can be displayed in an analysis of variance table, by partitioning the total variability in the dependent variable into a component explained by the regression line and unexplained or residual variation.

因变量 的总平方和为 (自由度为 ),回归平方和为 (自由度为1)。残差平方和(自由度为 )可通过相减获得。
The total sum of squares of the dependent variable is (with degrees of freedom) and the sum of squares due to the regression is with 1 degree of freedom. The residual sum of squares (with degrees of freedom) can be obtained by subtraction.

对于血糖数据,总平方和为1.1934,回归平方和为 。这些结果见表11.7。
For the blood glucose data the total sum of squares is 1.1934 and the sum of squares due to regression is . These results are shown in Table 11.7.

11.14 回归的解释 11.14 INTERPRETATION OF REGRESSION

如第8章所述,一组观测值的变异性部分可归因于已知因素,部分则来自未知来源;后者通常称为“随机变异”。在线性回归中,我们观察响应变量的变异性有多少可归因于预测变量的不同取值,拟合线两侧的散点显示了未解释的变异性。由于这种变异性,拟合线仅是总体中变量关系的估计值。与其他估计(如样本均值)类似,估计的斜率和截距 也存在不确定性。斜率 的置信区间反映了估计关系强度的不确定性,而整条线的置信区间和个体预测区间则展示了变异性的其他方面。后者尤其有用,因为回归常用于对个体进行预测。
As discussed in Chapter 8, the variability among a set of observations may be partly attributed to known factors and partly to unknown sources; the latter is often termed 'random variation'. In linear regression we see how much of the variability in the response variable can be attributed to different values of the predictor variable, and the scatter either side of the fitted line shows unexplained variability. Because of this variability, the fitted line is only an estimate of the relation between these variables in the population. As with other estimates (such as a sample mean) there will be uncertainty associated with the estimated slope and intercept, and . The confidence interval for the slope will indicate the uncertainty in the estimated strength of the relationship, and confidence intervals for the whole line and prediction intervals for individual subjects show other aspects of variability. The latter are especially useful as regression is often used to make predictions about individuals.

应当记住,回归线不应用于对观察数据中 值范围之外的预测。
It should be remembered that the regression line should not be used to make predictions for values outside the range of values in the observed

这种外推是不合理的,因为我们没有关于观察数据范围之外关系的证据。统计模型只是一个近似。例如,很少有人相信真实关系恰好是线性的,但线性回归方程被视为对观察数据的合理近似。在观察数据范围之外,不能安全地使用相同的方程。因此,我们不应使用图11.14中显示的回归线来预测血糖值在4到 范围之外的Vcf。
data. Such extrapolation is unjustified as we have no evidence about the relationship beyond the observed data. A statistical model is only an approximation. One rarely believes, for example, that the true relationship is exactly linear, but the linear regression equation is taken as a reasonable approximation for the observed data. Outside the range of the observed data one cannot safely use the same equation. Thus we should not use the regression line shown in Figure 11.14 to predict Vcf for blood glucose values outside the range 4 to .

外推危险的一个例子是对1954年至1984年间世界纪录跑一英里时间的二次回归模型拟合。Kitson (1984) 给出了模型
An example of the danger of extrapolation is seen from a quadratic regression model fitted to the world record times to run a mile from 1954 to 1984. Kitson (1984) produced the model

其中“Year”是日历年份减去1900。他预测“终极一英里”将于1998年以3分46.66秒的时间完成,并基于此模型认为“我们可能已经接近终极一英里不到一秒”。然而,他未能注意到,1998年之后该模型显示世界纪录时间将开始回升(见图11.17),这显然是不可能的!
where 'Year' is the calendar year - 1900. He observed that the 'ultimate mile' will be run in 1998 in a time of 3 min 46.66 sec, and that on the basis of this model 'we may already be within one second of the ultimate mile'. He failed to observe, however, that after 1998 his model indicates that the world record time will start to increase again (see Figure 11.17), which is clearly impossible!

回归线也不应用于根据 变量预测 变量。如果我们希望根据 Vcf 预测血糖水平(这可能不太合理),首先应计算血糖对 Vcf 的回归。
Nor should the regression line be used to predict the variable from the variable. If we wish to predict blood glucose level from Vcf (which is probably not very sensible) we ought first to calculate the regression of


图11.17 拟合到一英里世界纪录时间的二次曲线(Kitson,1984),显示了观测范围(1954年至1984年)。
Figure 11.17 Quadratic curve fitted to world record times to run a mile (Kitson, 1984), showing the range of observations (1954 to 1984).

回归不是两个变量之间的对称关系,因此我们需要针对具体目的选择合适的回归线。
blood glucose on Vcf. Regression is not a symmetric relation between two variables, so we need the appropriate regression line for our purpose.

关于相关系数解释的诸多警示在回归分析中不完全适用。其中之一是将不同组的数据当作单一样本处理的情况。如果将两个组的数据合并,而这两个组在任一变量或两个变量的分布上存在显著差异,回归线的斜率可能会受到很大影响。例如,男性和女性的血压对年龄的回归。此类数据应分别分析不同组,或采用协方差分析。另一限制是观测值应相互独立;实际上,这意味着每个个体只能有一个观测值。
Few of the cautions that were made about the interpretation of the correlation coefficient apply to regression analysis. One that does is that relating to the analysis of data from different groups as if they were a single sample. The slope of the regression line may be considerably affected if we pool data from two groups where there is a marked difference in the distribution of the values of either or both variables. An example would be the regression of blood pressure on age for males and females. Such data should either be analysed separately for the different groups or analysis of covariance should be used. Another restriction that is relevant is that the observations should be independent; in practice this means that there should be only one observation per individual.

当预测变量的值由实验者选择时,回归分析是有效的,这在实验室实验中很常见。此外,如前所述, 变量不必服从正态分布。然而,如果存在一个远离数据主体的 值,且对应的 值也极端,该观测点可能对回归线位置产生过大影响。图7.2中给出了一个例子。
Regression analysis is valid when values of the predictor variable have been selected by the experimenter, as is common in laboratory experiments. Also, as I have already noted, there is no requirement for the variable to have a Normal distribution. However, if there is a value of that is distant from the main body of the data, that observation may exert an undue influence on the position of the regression line especially if the value of the variable is also extreme. An example was given in Figure 7.2.

如果数据或残差的分布使得使用上述回归方法的合理性受到质疑,可以采用非参数回归(参见 Sprent,1989)。非参数回归极少使用,与非参数相关分析形成对比。
If the distribution of the data or of the residuals leads to concern about the wisdom of using the regression methods described there is a non- parametric form of regression (see Sprent, 1989). Non- parametric regression is very rarely performed, in contrast to non- parametric correlation.

11.15 与其他分析的关系 11.15 RELATION TO OTHER ANALYSES

(本节可略读,不影响连贯性。)
(This section can be omitted without loss of continuity.)

之前章节讨论的两种分析说明了回归的专门用途,尽管它们未被称为回归分析。分别是一元方差分析中的线性趋势检验(第9.8.5节)和比例趋势的卡方检验(第10.8.2节)。下面简要回顾。
Two analyses discussed in earlier chapters illustrate specialized uses of regression, although they were not presented as regression analyses. These are the test for linear trend in a one way analysis of variance (section 9.8.5) and the Chi squared test for trend among proportions (section 10.8.2). They are reconsidered briefly below.

11.15.1 一元方差分析中的趋势 11.15.1 Trend in one way analysis of variance

单因素方差分析中,针对三个或更多组的趋势检验在第9.8.5节和9.9.2节中已有描述。给各组赋予分数 ,并将组间平方和分解为线性和非线性部分。线性趋势检验几乎等同于对结果变量进行分数的回归分析,但不完全相同。
The test for trend across three or more groups in a one way analysis of variance was described in sections 9.8.5 and 9.9.2. Scores were given to the groups and the between group sum of squares was partitioned into linear and non- linear components. The test for linear trend is almost equivalent to a regression of the outcome variable on the scores. It is not

其区别在于方差分析还用一个自由度检验组间的非线性变化,但本质上该方法是线性回归分析。回归线的斜率等于 ,对应于响应变量随分数单位变化的变化量。该统计量比依赖分数值的统计量 更有用。
exactly the same because the analysis of variance also uses one degree of freedom to test for non- linear variation among the groups, but in essence the method is a linear regression analysis. The slope of the regression line is equal to , and corresponds to the change in the response variable per unit change in score. This statistic is more useful than the statistic , which depends upon the values of the scores.

当只有两组,分数为 和 1 时,基于组分数的回归分析完全等同于两样本 检验。回归线的斜率是两组均值差的一半。
When there are only two groups with scores and 1, regression on the group scores is exactly equivalent to the two- sample test. The slope of the regression line is half the difference between the group means.

11.15.2 频数表中的趋势检验 11.15.2 Trend in a frequency table

趋势卡方检验用于评估 频数表中比例的趋势(见第10.8.2节)。该方法等同于对行变量(编码为0和1)与列分数进行回归。以表10.19中的剖宫产数据为例,若给六个鞋码组赋予1至6的分数,对351个观测值中行变量(编码0和1)与分数回归,斜率为 (标准误0.0106),对应 )。 值与第10.8.2节中趋势卡方检验结果相同。然而,回归方法更具信息量,因为它给出从一组到下一组比例变化的估计值。这里估计为 (即鞋码每增 ,比例减少3%)。我们可用标准误按常规方法计算置信区间。
The Chi squared test for trend is used to assess a trend in proportions in a frequency table (see section 10.8.2). The method is exactly equivalent to regressing the row variable, coded 0 and 1 say, on the column scores. For the Caesarean section data in Table 10.19, if we give scores 1 to 6 to the six shoe size groups, regression of these scores on the row variable (coded 0 and 1) for the 351 observations gives a slope of (SE 0.0106), giving ( ). The value is thus the same as for the Chi squared test for trend shown in section 10.8.2. However, the regression approach is more informative as it yields an estimate of the change in proportion from one group to the next. Here the estimate is (i.e. a reduction of ) per increment of in shoe size. We can use the standard error to obtain a confidence interval in the usual way.

11.16 回归结果的呈现 11.16 PRESENTATION OF REGRESSION

应给出回归线方程及残差标准差。尽可能在图中同时展示回归线和原始数据散点图。回归线不应超出预测变量 的观测范围。仅绘制回归线图并不比给出方程提供更多信息。
The equation of the regression line should be given, together with the residual standard deviation. Wherever possible the regression line should be shown in a plot together with a scatter diagram of the raw data. The line should not extend beyond the range of the observed values of the predictor variable . A plot of the regression line alone gives no more information than the equation of the line.

斜率的标准误及 检验的 值很有用。回归线的置信区间,或更实用的新观测预测区间,尤其具有参考价值,可在同一图中展示。
The standard error of the slope is useful, as is the value for the test. A confidence interval for the line or, more usefully, prediction intervals for new observations are especially informative and can be shown in the same plot.

系数所用的精确度应与原始数据的精确度相关。举例来说,给出一个声称能预测出生体重精确到 的方程是没有意义的,以下关于出生体重 对胎儿腹部面积 的二次回归方程就暗示了这一点: The accuracy used for the coefficients should be related to the accuracy of the raw data. It makes no sense, for example, to give an equation that purports to predict birth weight to the nearest , which is what is implied by the following quadratic regression equation of birth weight on fetal abdominal area :

(Campogrande 等,1977)。通常估计值 会大于 ,但 往往报告到相同的小数位数。然而,进行预测时需要更精确的是斜率 ,而不是 ,因此 应该至少与 一样精确,甚至更精确。这里的精确度指的是“有效数字”的位数(即忽略开头的零)。因此,在前面给出的方程 血糖中,截距和斜率均给出了三个有效数字。与上述的二次方程形成对比。
(Campogrande et al., 1977). It is common for the estimate of to be larger than that of , but and are frequently reported to the same number of decimal places. However, it is the slope, , that is needed with more precision, not less, when making predictions, so it should be given at least as precisely as , if not more so. Precision here refers to the number of 'significant digits' (i.e. ignoring zeros at the beginning). Thus, in the equation given earlier, blood glucose, the intercept and slope are both given to three significant digits. Contrast this with the quadratic equation given above.

大多数回归分析的计算机程序都会提供执行本章所述所有计算所需的信息。虽然不多程序会实际计算和绘制置信区间和预测区间,但它们应当给出残差标准差(可能名称不同),以便计算这些区间。
Most computer programs for regression analysis give the information necessary to perform all the calculations described in this chapter. Not many will actually calculate and plot confidence intervals and prediction intervals, but they should give the residual standard deviation (perhaps under a different name) to allow these intervals to be calculated.

在计算机程序的输出中,量 有时被称为“估计的标准误”(SEE)。这个名称不准确,因为它错误地暗示这是从回归线估计的任意值 的标准误。实际上, 只是 的均值处,即 时的标准误(见第11.13.3节)。如我们所见,随着远离均值,不确定性会增加。这种错误有时出现在发表的论文中,置信限被画成与回归线平行。更糟的是,有些程序将残差标准差 称为“估计的标准误”,这极具误导性。
In output from computer programs the quantity is sometimes called the 'standard error of the estimate' (SEE). This is not a good name as it wrongly implies that it is the standard error of any value estimated from the regression line. In fact is the standard error of only at the mean value of , i.e. when (see section 11.13.3). As we have seen, uncertainty increases as we move away from the mean. This mistake is sometimes seen in published papers where confidence limits are shown parallel to the regression line. Worse, some programs call the residual standard deviation the 'standard error of the estimate', which is highly misleading.

11.17 回归还是相关? 11.17 REGRESSION OR CORRELATION?

本章分别介绍了回归和相关,以阐明它们目的的不同。然而,从数学上看,这两种方法非常密切相关,如第11.7节和11.13节的公式所示。实际上,零相关的零假设的 检验与回归分析中零斜率假设的检验完全等价— 值相同。许多计算机程序在执行回归分析时会自动提供相关系数,但应记住回归和相关是不同的方法,服务于不同的目的。除非确实对两者都感兴趣,否则通常不宜同时进行两种分析,这种情况并不常见。例如,如果我们知道阿尔巴尼亚人的肝硬化死亡率,并不会想用它来预测阿尔巴尼亚的猪肉消费。相反,一旦进行了更具信息量的回归分析,我们对 Vcf 与血糖水平之间的相关性就不再感兴趣。
Regression and correlation have been presented separately in this chapter to clarify the difference between their purposes. Mathematically, however, the two methods are very closely related, as can be seen from the formulae in sections 11.7 and 11.13. In fact the test of the null hypothesis of zero correlation is exactly equivalent to that for the hypothesis of zero slope in regression analysis - the values are identical. Many computer programs automatically provide the correlation coefficient when performing a regression analysis, but it helps to remember that regression and correlation are distinct methods which serve different purposes. It is not usually sensible to perform both unless one is genuinely interested in both analyses, which is probably not very common. For example, we would not wish to predict the consumption of pork in Albania if we happened to know the mortality from cirrhosis among Albanians. In contrast, we are not interested in the correlation between Vcf and blood glucose level once we have carried out the much more informative regression analysis.

相关性是一种被过度使用的技术,显著的相关系数常被错误地解读为重要,甚至更糟,被误认为必然表明因果关系。
Correlation is a much over- used technique, with a significant correlation

它的使用应主要用于生成假设,而非检验假设。相关性将一组数据简化为一个数字,这个数字与实际数据没有直接关系。回归分析是一种更有用的方法,其结果与所获得的测量值有明确的联系。关系的强度是明确的,不确定性可以通过置信区间或预测区间清楚地看到。
coefficient often wrongly interpreted as important and, even worse, as necessarily indicating a causal relationship. Its use should be mainly for generating hypotheses rather than for testing them. Correlation reduces a set of data to a single number that bears no direct relation to the actual data. Regression is a much more useful method, with results which are clearly related to the measurements obtained. The strength of the relation is explicit, and uncertainty can be seen clearly from confidence intervals or prediction intervals.

给一个人三种武器—相关性、回归和一支笔—他会全部使用。
Give a man three weapons - correlation, regression, and a pen - and he will use all three.

(匿名,1978)
(Anon, 1978)

练习 EXERCISES

11.1 乳酸性酸中毒是一种酸碱代谢紊乱,通常迅速致命。二氯乙酸以静脉注射方式(50 mg/kg 体重)给予了29名儿科和成人患者(Stacpoole 等,1988年)。下表显示了一些代谢和血流动力学变量的记录变化,以及患者的生存时间(小时)。
11.1 Lactic acidosis, a disorder of acid- base metabolism, is usually rapidly fatal. Dichloroacetate was administered intravenously body weight) to 29 paediatric and adult patients (Stacpoole et al., 1988). The table below shows the recorded changes in some metabolic and haemodynamic variables, together with the patients' survival times (in hours).

患者动脉水平变化生存时间
乳酸碳酸氢盐pH值
14.1-1.2-0.054
2-4.42.00.034
30.12.90.0214
44.4-2.50.0715
58.7-4.0-0.1216
6-30.74.40.1724
71.7-0.90.0129
8-1.54.50.1531
97.41.8-0.1332
109.9-12.9-0.2836
1113.1-11.9-0.3336
123.1-6.3-0.2236
1315.2-2.0-0.1641
142.51.00.0146
157.92.5-0.2248
164.2-2.2-0.0348
172.8-4.0-0.0460
1814.3-2.4-0.0160
PatientChange in arterial level ofSurvival time
LactateBicarbonatepH
14.1-1.2-0.054
2-4.42.00.034
30.12.90.0214
44.4-2.50.0715
58.7-4.0-0.1216
6-30.74.40.1724
71.7-0.90.0129
8-1.54.50.1531
97.41.8-0.1332
109.9-12.9-0.2836
1113.1-11.9-0.3336
123.1-6.3-0.2236
1315.2-2.0-0.1641
142.51.00.0146
157.92.5-0.2248
164.2-2.2-0.0348
172.8-4.0-0.0460
1814.3-2.4-0.0160

322 两个连续变量之间的关系
322 Relation between two continuous variables

患者乳酸碳酸氢盐动脉水平变化pH值生存时间
1916.2-12.8-0.1572
2017.5-4.4-0.0996
212.7-7.1-0.21192
224.4-4.7-0.05336
234.8-9.8-0.05456
249.0-7.50.09672
2514.7-7.2-0.23768
266.2-4.2-0.131080
2718.4-12.3-0.122160
2816.9-8.6-0.172160
2926.0-21.3-0.3224456*
PatientLactateChange in arterial level of BicarbonatepHSurvival time
1916.2-12.8-0.1572
2017.5-4.4-0.0996
212.7-7.1-0.21192
224.4-4.7-0.05336
234.8-9.8-0.05456
249.0-7.50.09672
2514.7-7.2-0.23768
266.2-4.2-0.131080
2718.4-12.3-0.122160
2816.9-8.6-0.172160
2926.0-21.3-0.3224456*

*:仍然存活
*: still alive

(a) 作者使用斯皮尔曼等级相关分析生存时间的关联性。考虑到其中一个生存时间是删失数据,这种分析方法是否有效?
(a) The authors used Spearman's rank correlation to look for associations with survival time. Is this a valid analysis, bearing in mind that one of the survival times is censored?

(b) 使用皮尔逊相关系数分析是否有效?
(b) Would the use of Pearson's correlation coefficient be valid?

(c) 哪个变量与生存时间的相关性最强?
(c) Which variable has the strongest correlation with survival time?

11.2 下表显示了44名女性的静息代谢率(RMR,kcal/24小时)和体重(kg)(Owen 等,1986年)。
11.2 The following table shows resting metabolic rate (RMR) (kcal/24 hr) and body weight (kg) of 44 women (Owen et al., 1986).

体重RMR体重RMR
149.9107917
250.8114618
351.8111519
452.6116120
557.6132521
661.4135122
762.3140223
864.9136524
943.187025
1048.1137226
1152.2113227
1253.5117228
1355.0103429
1455.0115530
1556.0139231
1657.8109032
Body weightRMRBody weightRMR
149.9107917
250.8114618
351.8111519
452.6116120
557.6132521
661.4135122
762.3140223
864.9136524
943.187025
1048.1137226
1152.2113227
1253.5117228
1355.0103429
1455.0115530
1556.0139231
1657.8109032
体重静息代谢率 (RMR)体重静息代谢率 (RMR)
3388.6132339107.7
3489.3130040110.2
3591.6151941122.0
3699.8163942123.1
37103.0138243125.2
38104.5141444143.3
Body weightRMRBody weightRMR
3388.6132339107.7
3489.3130040110.2
3591.6151941122.0
3699.8163942123.1
37103.0138243125.2
38104.5141444143.3

(a) 对静息代谢率(RMR)与体重进行线性回归分析。
(a) Perform linear regression analysis of RMR on body weight.

(b) 检查残差的分布。该分析是否有效?
(b) Examine the distribution of residuals. Is the analysis valid?

(c) 获得回归线斜率的 置信区间。
(c) Obtain a confidence interval for the slope of the line.

(d) 是否可以用个体的体重来预测其静息代谢率,误差在 250 千卡/24小时以内?
(d) Is it possible to use an individual's weight to predict their RMR to within 250 kcal/24hr?

11.3 在回归分析实例(第11.10.1节)中,残差非正态性的 检验得到了
11.3 In the worked example of regression analysis (section 11.10.1) the test for non- Normality of the residuals gave

(a) 利用表11.6中的数据,对 与血糖进行回归分析。
(a) Using the data in Table 11.6, carry out a regression of on blood glucose.

(b) 该分析的残差是否更接近正态分布?
(b) Are the residuals from this analysis more nearly Normal?

(c) 比较两个模型对一名空腹血糖为 糖尿病患者预测的 Vcf 及其 预测区间。
(c) Compare the predicted Vcf and prediction intervals derived from the two models for a diabetic patient with a fasting blood glucose of

11.4 表11.2中的数据有什么异常?
11.4 What is odd about the data in Table 11.2?

11.5 地高辛是一种主要以原形通过尿液排泄的药物。其肾清除率据说(a)与肌酐清除率相关,(b)与尿流量无关。下表显示了35名因充血性心力衰竭接受地高辛治疗的连续住院患者的这三项变量的测量值(Halkin 等,1975)。
11.5 Digoxin is a drug that is largely eliminated unchanged in the urine. Its renal clearance was said to be (a) correlated with creatinine clearance and (b) independent of urine flow. The following table shows measurements of these three variables from 35 consecutive inpatients being treated with digoxin for congestive heart failure (Halkin et al., 1975).

患者清除率(ml/min/1.73 m²)尿流量(ml/min)
肌酐地高辛
119.517.50.74
224.734.80.43
326.511.40.11
431.129.31.48
531.313.90.97
631.831.61.12
PatientClearances (ml/min/1.73 m²)Urine flow (ml/min)
CreatinineDigoxin
119.517.50.74
224.734.80.43
326.511.40.11
431.129.31.48
531.313.90.97
631.831.61.12

324 两个连续变量之间的关系
324 Relation between two continuous variables

患者清除率(ml/min/1.73 m²)尿流量(ml/min)
肌酐地高辛
734.120.71.77
836.634.10.70
942.425.00.93
1042.847.42.50
1144.231.80.89
1249.736.10.52
1351.322.70.33
1455.030.70.80
1555.942.51.02
1661.242.40.56
1763.161.10.93
1863.738.20.44
1966.837.50.50
2072.450.10.97
2180.950.21.02
2282.050.00.95
2382.731.80.76
2487.955.41.06
25101.5110.61.38
26105.0114.41.85
27110.569.32.25
28114.284.81.76
29117.863.91.60
30122.676.10.88
31127.9112.81.70
32135.682.20.98
33136.046.80.94
34153.5137.71.76
35201.176.10.87
PatientClearances (ml/min/1.73 m²)Urine flow (ml/min)
CreatinineDigoxin
734.120.71.77
836.634.10.70
942.425.00.93
1042.847.42.50
1144.231.80.89
1249.736.10.52
1351.322.70.33
1455.030.70.80
1555.942.51.02
1661.242.40.56
1763.161.10.93
1863.738.20.44
1966.837.50.50
2072.450.10.97
2180.950.21.02
2282.050.00.95
2382.731.80.76
2487.955.41.06
25101.5110.61.38
26105.0114.41.85
27110.569.32.25
28114.284.81.76
29117.863.91.60
30122.676.10.88
31127.9112.81.70
32135.682.20.98
33136.046.80.94
34153.5137.71.76
35201.176.10.87

这些数据是否支持上述(a)和(b)两项陈述?
Do these data support statements (a) and (b) above?

12 多变量之间的关系 12 Relation between several variables

对数据集的探索是值得称赞的,但研究者应明白自己是在探索和寻找,而不是在复核一个验证性实验。
Exploration of the data set is admirable, but the investigator should know that he is exploring and searching, not reviewing a confirmatory experiment.

Lachenbruch(1977)
Lachenbruch (1977)

12.1 引言 12.1 INTRODUCTION

第9、10和11章涵盖了分析绝大多数医学数据集所用的基本统计方法。很少有研究报告不使用这些技术中的某些方法,而且大多数研究不会超出这些方法。然而,大多数研究会收集许多变量的数据,这些数据要么通过一系列简单分析处理,要么通过更复杂的统计方法分析。通常,若条件允许,优先使用更高级的方法,而不是分别单独查看数据集的几个小部分。
Chapters 9, 10 and 11 cover the basic statistical methods used to analyse the large majority of medical data sets. Few research reports do not make use of some of those techniques, and most will not go further. Most studies, however, obtain data on many variables, which are either analysed by a series of simple analyses or by rather more complicated statistical methods. In general it is preferable to use the more advanced methods where these are appropriate, rather than looking separately at several small parts of the data set.

本章基于第9至11章的方法,扩展这些章节中的思想以适应更复杂的数据集。第13章将继续这一过程,但专门讨论生存数据的分析,生存数据即使在简单比较中也存在若干特殊问题。
This chapter builds on the methods of Chapters 9 to 11, by extending the ideas in those chapters to more complex data sets. Chapter 13 continues the process, but is devoted to the analysis of survival data, which poses several special problems even in simple comparisons.

12.2 方差分析和多元回归 12.2 ANALYSIS OF VARIANCE AND MULTIPLE REGRESSION

第9章介绍了多种方法,用于比较两个或多个组在单个连续变量上的差异。在12.3节中,我将展示如何将这些方法扩展到考虑具有两个或更多分类变量的数据集,这些方法统称为方差分析,无论是参数法还是非参数法。如果有两个分类变量,则称为双因素方差分析,依此类推。这些方法要求交叉分类的每个“单元格”中观察数相同,这一条件在实验研究中常见但并非总是满足,而在观察性研究中则几乎不可能满足。
Chapter 9 introduced a variety of methods for comparing two or more groups with respect to a single continuous variable. In section 12.3 I shall show how these methods can be extended to consider data sets with two or more classifying variables, methods given the general name analysis of variance whether parametric or non- parametric. If there are two classifying variables the analysis is known as two way analysis of variance, and so on. These methods require the same number of observations in each 'cell' of the cross- classification, a condition often, but not always, met in experimental studies but rarely, if ever, true for observational studies. For

例如,如果我们想比较不同妊娠期男婴和女婴的出生体重,我们无法控制每个年龄-性别组中的婴儿数量,因此无法使用方差分析。
example, if we wish to compare birth weights of boys and girls with different lengths of gestation we cannot control the numbers of babies in each age- sex group, so we cannot use analysis of variance.

解决这一问题的方法,或许令人惊讶,是与第11章描述的线性回归技术相关的。我在那里展示了如何描述两个变量之间的关系,或者更具体地说,如何根据一个变量的值预测另一个变量的值。这一方法也可以扩展,使我们能够根据多个其他变量的值预测某一变量的值。换句话说,我们有一个因变量(结果变量)和两个或更多自变量(预测变量)。该方法称为多元回归。自变量可以是连续的、二元(0-1)或分类的。因此,多元回归可以用来回归出生体重与性别和妊娠期的关系。可以证明,所有方差分析问题也可以在多元回归框架下进行分析(见12.4节),但对于平衡数据集(通常来自实验),更常用的是保持使用方差分析方法。
The way round this problem is, perhaps surprisingly, related to the technique of linear regression described in Chapter 11. I showed there how to describe the relation between two variables, or, more specifically, how the value of one variable can be predicted from the value of the other. This method too can be extended, to allow us to predict the value of a variable from the values of several other variables. In other words, we have a single dependent (outcome) variable and two or more explanatory (predictor) variables. The method is called multiple regression. The explanatory variables can be either continuous or binary (0- 1) or categorical. Multiple regression can thus be used to regress birth weight on sex and gestational age. It can be shown that all analysis of variance problems can also be analysed in the framework of multiple regression (see section 12.4), but for balanced data sets (usually from experiments) it is more common to keep to the analysis of variance approach.

上述讨论涉及结果变量为连续型的情况。在12.5节中,我将展示如何对二元结果变量采用类似的方法,即多重逻辑回归;第13章将使用相同的基本思想来分析生存数据。
The above discussion relates to the case where the outcome variable is continuous. In section 12.5 I shall show how a similar approach can be taken for a binary outcome variable, using multiple logistic regression. and in Chapter 13 the same general ideas will be used for the analysis of survival data.

12.3 双因素方差分析 12.3 TWO WAY ANALYSIS OF VARIANCE

在第9章中,我讨论了在独立个体组上测量同一变量的若干问题。通常,每个人可能在不同的实验条件下被测量多次,我们需要一种方法,可以看作是配对t检验的推广。此类数据可用称为双因素方差分析的方法处理,该方法用于分析可在两个分类变量(称为“因素”)的交叉分类中展示的数据。
In Chapter 9 I considered several problems involving the same measure­ ment taken on independent groups of individuals. Often more than one measurement is taken from each person, perhaps under different experi­ mental conditions, and we require a method that may be seen as a generalization of the paired t test. Data of this type can be dealt with by the method known as two way analysis of variance, which is used to analyse data which can be displayed within a cross- classification of two categorical variables, called 'factors'.

这类数据集的一般结构如表12.1所示,其中每个 表示一个观测值。在此结构中,每对因素A和B的水平组合可能有一个或多个观测值。我只考虑每个单元格中观测数相同的情况,因此假设没有缺失观测。
The general structure of such data sets is shown in Table 12.1 when each indicates an observation. In this structure, we may have one or more observations for each combination of levels of the two factors A and B. I shall only consider the case where the number of observations in each cell is the same. I shall assume, therefore, that there are no missing observations.

本节讨论两类符合此框架的研究。第一类是同一变量在不同条件下对同一组个体进行两次或多次观测,例如每位患者接受多种治疗。此时图中因素B代表不同的受试者。每个治疗下每位受试者可能有多个观测值。
This section deals with two types of study that fall into this framework The first is where two or more observations of the same variable are taken from the same individuals under different circumstances, for example where each patient receives more than one treatment. Here factor B in the diagram represents different subjects. There may be more than on

表12.1 双因素交叉分类的一般结构。每个 表示单个观测值,x…x 表示一系列观测值
Table 12.1 General structure of a two way cross-classification. Each represents a single observation, and x…x represents a series of observations

因素 B因素 A
123c
1x…xx…xx…x..x…x
2x…xx…xx…x..x…x
3x…xx…xx…x..x…x
......
......
rx…xx…xx…x..x…x
Factor BFactor A
123c
1x…xx…xx…x..x…x
2x…xx…xx…x..x…x
3x…xx…xx…x..x…x
......
......
rx…xx…xx…x..x…x

每位受试者在每种治疗下可能有多次观测。
observation per subject on each treatment.

第二类情况是两个因素共同决定测量性质,每种组合施加于一个或多个患者。例如,我们可能对男性和女性分别在两种或多种不同治疗后的血压进行观测。此时因素A和B分别代表治疗和性别,每种组合下有多个受试者。我将详细讨论各自的一个例子,然后再讨论其他设计。
The second case is where there are two factors specifying the nature of the measurements, and each combination is given to one or more patients. For example we may have observations on blood pressure after two or more different treatments for males and females separately. Here factors A and B represent treatment and sex, and there are several different subjects for each combination. I shall consider one example of each in detail, and then discuss other designs.

12.3.1 重复观测 12.3.1 Repeated observation

表12.2显示了九位充血性心力衰竭患者的心率。
Table 12.2 shows the heart rate of nine patients with congestive heart

表12.2 恩那普利短期对心率的影响(每分钟心跳次数)(Maskin 等,1985年)
Table 12.2 Short-term effect of enalaprilat on heart rate (beats per minute) (Maskin et al., 1985)

受试者时间(分钟)
03060120平均值(标准差)
19692869291.50(4.1)
2110106108114109.50(3.4)
38986858385.75(2.5)
49578788383.50(8.0)
5128124118118122.00(4.9)
6100981009498.00(2.8)
77268677169.50(2.4)
87975747475.50(2.4)
9100106104102103.00(2.6)
Mean (SD)96.5692.5691.1192.3393.14
(16.4)(17.8)(17.2)(16.5)(16.4)
SubjectTime (mins)
03060120Mean(SD)
19692869291.50(4.1)
2110106108114109.50(3.4)
38986858385.75(2.5)
49578788383.50(8.0)
5128124118118122.00(4.9)
6100981009498.00(2.8)
77268677169.50(2.4)
87975747475.50(2.4)
9100106104102103.00(2.6)
平均值 (标准差)96.5692.5691.1192.3393.14
(16.4)(17.8)(17.2)(16.5)(16.4)

测量是在给予恩那普利(一种血管紧张素转换酶抑制剂)前及给药后30、60和120分钟进行的。该设计看似与第9.8节中单因素方差分析类似,但这里不同时间点的测量均在同一受试者身上完成。因此,该设计更恰当地看作是配对t检验的自然扩展。该设计的优势在于,观察组间比较基于受试者内差异。受试者间差异通常较大(见表12.2),但不会影响我们区分不同时间点观察值差异的能力。
failure before and shortly after administration of enalaprilat, an angiotensin- converting enzyme inhibitor. Measurements were taken before and at 30, 60 and 120 minutes after drug administration. This design appears similar to that analysed by one way analysis of variance in section 9.8, but here the measurements at the different times are on the same subjects. Thus this design should more appropriately be seen as a natural extension of the paired test. The strength of this design is that comparisons between the sets of observations are based on within subject differences. Variation between subjects, which is usually considerable (see Table 12.2), does not affect our ability to distinguish differences between the sets of observations, which here relate to four time points.

在第9.8节,我展示了单因素方差分析中如何将总变异分解为组间和组内变异。双因素方差分析采用了类似的方法,但自然更为复杂。在本例中,对于表12.2中的心率数据,我们可以将总变异分解为时间间变异和受试者间变异,还有部分剩余变异,称为残差变异。该术语与第11章回归分析中的含义相同。
In section 9.8 I showed how in one way analysis of variance the total variability is separated into between group and within group components. A similar approach is adopted in two way analysis of variance, but naturally it is a bit more complicated. In the present example, for the heart rate data shown in Table 12.2, we can divide the total variability into components due to variation between times and between subjects, and there is some remaining variation which we refer to as residual variation. This term carries the same meaning as in regression analysis, described in Chapter 11.

表12.3显示了心率数据的方差分析表。用于检验受试者间和时间间方差(均方)的F值均通过除以残差方差获得。前者与自由度为8和24的F分布比较,后者与自由度为3和24的F分布比较。受试者间变异的P值极小,这在医学数据中常见。所有受试者心率相同的原假设被坚决拒绝,但这并无实际意义。本研究目的是考察恩那普利给药后两小时内心率的变化,通过表12.3中“时间间”行进行检验。P值为0.018,表明我们可以合理拒绝心率无变化的原假设。表12.2显示了各时间点的均值。
Table 12.3 shows the analysis of variance table for the heart rate data. The values for testing the between subjects and between times variances (mean squares) are each obtained by dividing by the residual variance. The former is compared with the distribution with 8 and 24 degrees of freedom, and the latter with that with 3 and 24 degrees of freedom. The between subject variation has an extremely small value, as is often the case with medical data. The null hypothesis that all subjects have the same heart rate is firmly rejected, but this is of no real interest. The purpose of this study was to investigate variation in heart rate over the two hours after administration of enalaprilat, which is examined by considering the 'between times' row of Table 12.3. The value of 0.018 indicates that we can reasonably reject the null hypothesis that there is no change in heart rate over the two hours. Table 12.2 shows the means for each time point.

表12.3 表12.2数据的方差分析
Table 12.3 Analysis of variance of data in Table 12.2

变异来源自由度平方和均方F值P值
受试者88966.5561120.81990.6< 0.0001
时间3150.97250.3244.070.018
残差24296.77812.366
总计359414.306
Source of variationdfSums of squaresMean squaresFP
Subjects88966.5561120.81990.6&lt; 0.0001
Times3150.97250.3244.070.018
Residual24296.77812.366
Total359414.306

表明心率在给药后30分钟平均下降了4次/分钟,并在接下来的90分钟内保持相对稳定。平均趋势从原始数据表中不易直接观察到。
indicating that heart rate fell by an average four beats per minute (bpm) after 30 minutes, and remained fairly stable over the next 90 minutes. The average pattern is not obvious from examination of the raw data in the table.

关于时间趋势的具体假设可以使用与单因素方差分析相同的方法进行检验。例如,我们可以比较每一对时间点,采用 Bonferroni 校正以控制多重检验带来的误差,或者检验时间上的线性趋势。我们还可以构建任一时间点的均值或均值差的置信区间。对于所有这些分析,关键是要使用正确的方差,即在去除个体间变异后的残差方差。
Specific hypotheses relating to the time trend can be examined using the same approach as in one way analysis of variance. We could, for example, compare each pair of times, with a Bonferroni correction to allow for multiple testing, or look for a linear trend over time. We can also construct confidence intervals for the mean at any time or the difference between means. For all of these analyses it is essential that we use the correct variance, after the between subject variation has been removed, which is the residual variance.

残差方差为 12.366,因此残差标准差为 bpm。通过拟合方差分析中隐含的模型,我们假设每个受试者的心率随时间的真实反应模式相同,或者等价地,个体间差异在每个时间点均相同。任何偏离该模型的情况均表示随机变异,例如测量误差。所有观测值的均值为 93.14 bpm,我们可以将每列和每行的均值表示为与总体均值的差异。每个单元格的预测值则是对应行均值与列均值之和减去总体均值,即
The residual variance is 12.366 so the residual standard deviation is bpm. By fitting the model implicit in the analysis of variance we have assumed that the true response pattern of heart rate over time is the same for each subject, and (equivalently) that the differences between subjects are the same at each time. Any departures from this model indicate random variation, for example that resulting from measurement error. The mean of all the observations was 93.14 bpm, and we can express the means for each column and row as differences from the overall mean. The value predicted in each cell is then obtained by adding the relevant row and column means, and subtracting the overall mean, as

表 12.4 基于双因素方差分析模型的预测心率
Table 12.4 Predicted heart rate based on the two way analysis of variance model

受试者时间(分钟)与总体均值的差异
03060120均值
194.9290.9289.4790.6991.50-1.64
2112.92108.92107.47108.69109.50+16.36
389.1785.1783.7284.9485.75-7.39
486.9282.9281.4782.6983.50-9.64
5125.42121.42119.97121.19122.00+28.86
6101.4297.4295.9797.1998.00+4.86
772.9268.9267.4768.6969.50-23.64
878.9274.9273.4774.6975.50-17.64
9106.42102.42100.97102.19103.00+9.86
均值96.5692.5691.1192.3393.14
与总体均值的差异3.42-0.58-2.03-0.81
SubjectTime (mins)Difference from overall mean
03060120Mean
194.9290.9289.4790.6991.50-1.64
2112.92108.92107.47108.69109.50+16.36
389.1785.1783.7284.9485.75-7.39
486.9282.9281.4782.6983.50-9.64
5125.42121.42119.97121.19122.00+28.86
6101.4297.4295.9797.1998.00+4.86
772.9268.9267.4768.6969.50-23.64
878.9274.9273.4774.6975.50-17.64
9106.42102.42100.97102.19103.00+9.86
Mean96.5692.5691.1192.3393.14
Difference from overall mean3.42-0.58-2.03-0.81

表 12.5 方差分析的残差,计算方法为表 12.2 与表 12.4 中数值的差
Table 12.5 Residuals from the analysis of variance, calculated as the difference between the entries in Tables 12.2 and 12.4

受试者时间(分钟)
03060120均值
11.081.08-3.471.310.00
2-2.92-2.920.535.310.00
3-0.170.831.28-1.940.00
48.08-4.92-3.470.310.00
52.582.58-1.97-3.190.00
6-1.420.584.03-3.190.00
7-0.92-0.92-0.472.310.00
80.080.080.53-0.690.00
9-6.423.583.03-0.190.00
均值0.000.000.000.000.00
SubjectTime (mins)
03060120Mean
11.081.08-3.471.310.00
2-2.92-2.920.535.310.00
3-0.170.831.28-1.940.00
48.08-4.92-3.470.310.00
52.582.58-1.97-3.190.00
6-1.420.584.03-3.190.00
7-0.92-0.92-0.472.310.00
80.080.080.53-0.690.00
9-6.423.583.03-0.190.00
Mean0.000.000.000.000.00

如表 12.4 所示。表 12.5 显示了观测数据与模型拟合值之间的差异,即残差。这些残差反映了模型拟合的不足,残差的方差即表 12.3 方差分析中显示的残差方差。如前所述,这些残差与等价回归分析中的残差完全一致。残差方差估计的是单个患者在同一时间点多次测量的方差(尽管这里只进行了一次测量)。
shown in Table 12.4. Table 12.5 shows the differences between the observed data and the values fitted by the model, called residuals. These show the lack of fit of the model, and the variance of the residuals is the residual variance shown in the analysis of variance in Table 12.3. As already noted, these residuals correspond exactly to residuals from the equivalent regression analysis. The residual variance is an estimate of the variance of multiple measurements on a single patient at the same time (even though only one such measurement was made).

12.3.2 假设 12.3.2 Assumptions

数据不要求总体或行列内服从正态分布。然而,残差应服从正态分布,这一假设可以通过正态概率图检验,如图 12.1 所示。心率残差的 检验结果为 ,因此我们可以放心模型在这方面是合理的。
There is no requirement for the data to be Normally distributed, neither overall nor within a row or column. The residuals, however, are expected to have a Normal distribution, an assumption that can be examined by a Normal plot as in Figure 12.1. The test for the heart rate residua. gives with , and so we can be happy that our model s reasonable in this respect.

即使残差分布接近正态,也不必然说明模型合适。观察表 12.5,受试者 4 和 9 存在较大残差,我们可能需要考虑不同个体的时间反应模式不一致的可能性。由于每个个体每个时间点只有一次观测,无法用这些数据检验此可能性。如果每个个体-时间组合有两次或更多观测,我们可以进行更全面的分析,具体来说,可以检验受试者和时间两个因素之间是否存在显著交互作用。下面将介绍这种更复杂的分析示例。如果
Even if the distribution of residuals is reasonably Normal it does not necessarily follow that the model is appropriate. Inspection of Table 12.5 shows some large values for subjects 4 and 9 and we might wish to consider the possibility that the response over time is not the same for all individuals. We cannot examine this possibility with these data, because there is only one observation per person at each time. If we had two or more observations for each person- time combination we would carry out a more comprehensive analysis. Specifically, we could examine the possible existence of a significant interaction between the two factors subject and time. An example of this more complex analysis is described below. If the


图12.1 表12.2数据方差分析残差的正态概率图。
Figure 12.1 Normal plot of residuals from analysis of variance of the data in Table 12.2.

如果方差分析的分布假设不成立,我们可以进行非参数分析,如第12.3.5节所述。
distributional assumption of the analysis of variance is not met, we can perform a non- parametric analysis, as described in section 12.3.5.

对用于说明双因素方差分析的心率数据的一个批评是,这些观测值来自同一实验中的一系列重复测量。这类数据严格来说不适合所描述的分析。一些软件可以执行“重复测量”方差分析,这对这类数据更为合适。另一种处理序列观测的方法见第14.6节。
A criticism of the heart rate data used to illustrate two way analysis of variance is that the observations relate to a sequence of repeated measurements in one experiment. Such data are not strictly appropriate for the analysis described. Some programs can perform a 'repeated measures' analysis of variance that is more correct for this type of data. Another way of looking at serial observations is described in section 14.6.

12.3.3 重复数据 12.3.3 Replicated data

方差分析也可用于研究测量变异性。表12.6展示了一项研究超声胎儿头围数据重现性的部分大量数据。四位观察者各自对同三个胎儿进行了三次测量。观察者对之前的测量结果一无所知,这与常规临床实践不同。该数据集与心率数据的结构性差异在于每个胎儿有三次重复测量。这使我们能够探讨观察者与胎儿之间是否存在交互作用;换言之,我们可以检验观察者间的差异是否因胎儿不同而超出偶然变异的预期。当我们研究一个或两个直接相关因素(如治疗和剂量)时,交互作用尤为重要。对于这组数据,
Analysis of variance can also be used to study measurement variability. Table 12.6 shows part of a large set of data from a study investigating the reproducibility of ultrasonic fetal head circumference data. Four observers each took three measurements on the same three fetuses. The observers were kept unaware of their previous measurements, in contrast to usual clinical practice. The structural difference between this data set and the heart rate data is the availability of three replicate readings per fetus. These enable us to investigate the possibility of an interaction between observers and fetuses; in other words, we can see if the differences between observers vary from fetus to fetus more than we expect just from chance variation. Interaction is more important when we investigate one or two factors of direct interest, such as treatment and dose. With this data

表12.6 四位观察者对胎儿头围(厘米)的测量
Table 12.6 Measurements of fetal head circumference (cm) by four observers

观察者1观察者2观察者3观察者4
胎儿114.313.613.913.8
14.013.613.714.7
14.813.813.813.9
胎儿219.719.819.519.8
19.919.319.819.6
19.819.819.519.8
胎儿313.012.412.813.0
12.612.812.712.9
12.912.512.513.8
Observer 1Observer 2Observer 3Observer 4
Fetus 114.313.613.913.8
14.013.613.714.7
14.813.813.813.9
Fetus 219.719.819.519.8
19.919.319.819.6
19.819.819.519.8
Fetus 313.012.412.813.0
12.612.812.712.9
12.912.512.513.8

对这组数据,我们并不特别关注这些特定的胎儿或观察者,而是希望估计测量的重现性。
set we are not especially interested in these particular fetuses or observers, but wish to estimate the reproducibility of the measurements.

表12.7显示了头围数据的方差分析表。测试每个效应的值均通过均方除以残差均方获得。受试者与观察者之间的交互作用不显著()。若交互作用不显著,最好将其从模型中剔除,将其平方和与残差平方和合并,得到表12.8所示的简化分析。一般而言,若交互作用显著,主效应(此处为“胎儿”和“观察者”)则没有简单的解释,因为每个效应依赖于另一个因素的水平。
Table 12.7 shows the analysis of variance table for the head circumference data. Again the values for testing each effect are obtained by dividing the mean squares by the residual mean square. The interaction between subjects and observers is not nearly significant . If the interaction is not significant it is best to remove it from the model by pooling its sum of squares with the residual variation to give the simplified analysis shown in Table 12.8. In general, if the interaction is significant the main effects (here 'fetuses' and 'observers') do not have a simple interpretation because the effect of each depends upon the level of the other factor.

利用表12.8中的残差方差,我们可以计算残差标准差为。因此,重复测量
Using the residual variance from Table 12.8 we can calculate the residual standard deviation as . Thus replicated measurements

表12.7 头围数据(表12.6)双因素方差分析结果
Table 12.7 Results of two way analysis of variance of the head circumference data in Table 12.6

变异来源自由度平方和均方F值P值
胎儿2324.009162.0042113< 0.001
观察者31.1990.4005.210.006
胎儿 × 观察者(交互作用)60.5620.0941.220.33
残差241.8400.077
总计35327.610
Source of variationDegrees of freedomSums of squaresMean squaresFP
Fetuses2324.009162.0042113&lt; 0.001
Observers31.1990.4005.210.006
Fetuses × Observers (Interaction)60.5620.0941.220.33
Residual241.8400.077
Total35327.610

表12.8 省略交互作用后的头围数据方差分析
Table 12.8 Analysis of variance of the head circumference data omitting the interaction

变异来源自由度平方和均方F值P值
胎儿2324.009162.0042023< 0.001
观察者31.1990.4004.990.006
残差302.4020.080
总计35327.610
Source of variationDegrees of freedomSums of squaresMean squaresFP
Fetuses2324.009162.0042023&lt; 0.001
Observers31.1990.4004.990.006
Residual302.4020.080
Total35327.610

同一胎儿由同一观察者重复测量的估计标准差仅为 ,表明测量误差较小。注意,这个分析中最有趣的部分是估计问题—假设检验并非真正关注的重点。
of the same fetus by the same observer have an estimated standard deviation of only , which shows that measurement error is small. Notice that this most interesting aspect of the analysis is an estimation problem - the hypothesis tests are not really of interest.

方差分析中 值的评估取决于分类变量本身是否有研究价值,还是仅代表更广泛的总体。这里的分析假设我们关注的是这些特定的胎儿和观察者,但这在本例中可能并不真实。然而,该分析与多元回归完全对应,且应用更为广泛。
The evaluation of values in the analysis of variance differs according to whether the classifying variables are interesting in their own right or whether they are representative of a wider population. The analysis described assumes that we are interested in these particular fetuses and observers, which is probably untrue in this case. However, the analysis described corresponds exactly to multiple regression, and is more widely used.

12.3.4 扩展 12.3.4 Extensions

通过两个简单数据集介绍了多因素方差分析的一些思想。如前所述,这两个数据集都存在使得所用方法稍显不适用的特点。要求非常严格,医学研究数据很少能完全满足。第5.4节举了一个更复杂数据集的例子,描述了一项研究,探讨左右臂血压的可能差异。每个受试者测量16次,每个臂(左或右)、观察者和袖带组合测量两次。数据因此采用四因素方差分析。
Some of the ideas of multi- way analysis of variance have been introduced by means of two simple data sets. As noted, both have features that make them slightly inappropriate for the methods used. The requirements are very strict, and are not often met perfectly by medical research data. An example of a more complex data set was given in section 5.4, where I described a study to investigate the possible difference in blood pressure between the left and right arms. Each subject had 16 measurements made, two for each combination of arm (left or right), observer and cuff. Thus the data were analysed by a four way analysis of variance.

三因素及以上设计遵循相同原则,但可能出现本书未涉及的更多问题,尤其是变量未完全交叉分类时。例如,测量一组受试者在两种饮食前后的代谢率,可用三因素方差分析(时间、饮食、受试者)。但若两种饮食分别给不同受试者组(如临床试验),则不能使用该分析,也不能用二因素分析。(但可对两组代谢率变化进行单因素方差分析或两样本 检验。)更复杂设计中的一些问题见 Armitage 和 Berry(1987,章节8)。如本章介绍的许多高级方法一样,建议寻求统计学家的帮助。
For three way designs and above the same principles are involved. However, further problems may arise which are beyond the scope of this book, especially when the variables are not fully cross- classified. For example, if we measure a group of subjects' metabolic rates before and after each of two types of diet, we could analyse the data by a three way analysis of variance (with factors time, diet and subject). But if the two diets were given to different groups of subjects, as in a clinical trial, we cannot use that analysis, nor can we use a two way analysis. (We could, however, perform a one way analysis of variance - or a two sample test - -

on the changes in metabolic rate in the two groups.) Some of the issues arising in more complex designs are discussed by Armitage and Berry (1987, Chapter 8). As with many of the more advanced methods introduced in this chapter, the advice of a statistician would be valuable.

多分类数据更常以无结构的方式出现,在这种情况下,我们可以使用第12.4节中描述的多元回归方法来分析数据。
More often, data from a multiple classification arise in an unstructured way, in which case we can analyse the data by multiple regression. described in section 12.4.

12.3.5 非参数双因素方差分析 12.3.5 Non-parametric two way analysis of variance

残差服从正态分布的假设无法在拟合模型之前进行评估。然而,有时可以从原始数据看出模型拟合效果不佳。特别是当各行或各列的标准差变化很大时,说明前述参数方差分析可能存在问题。
The assumption that the residuals have a Normal distribution cannot be assessed before fitting the model. Sometimes, however, it can be seen from the raw data that the model will not fit well. In particular, wide variation in the standard deviations for each row or column will suggest problems with the parametric analysis of variance just described.

存在一种非参数形式的双因素方差分析,适用于不满足参数方法假设的数据集。该方法有时称为弗里德曼双因素方差分析,纯粹用于假设检验。
There is a non- parametric form of two way analysis of variance that can be used for data sets which do not fulfil the assumptions of the parametric method. The method, which is sometimes known as Friedman's two way analysis of variance, is purely a hypothesis test.

表12.9展示了一项实验数据,比较了四种不同潜水服在模拟水下直升机逃生中的泄漏情况。四种潜水服标准差的较大变异提示应采用秩次分析。
Table 12.9 shows some data from an experiment to compare the leakage from four different types of immersion suit during simulated underwater helicopter escapes. The wide variability of the SDs for the four suits suggests that a rank analysis would be advisable.

表12.10显示了对每个受试者的四种潜水服泄漏值进行秩次排序的结果。该数据集中无并列秩次,若存在并列,则按常规计算平均秩次。
The values for the four suits are ranked for each subject as shown in Table 12.10. There are no ties in this data set, but if there are any ties we calculate average ranks in the usual way.

表12.9 模拟水下直升机逃生中潜水服泄漏量(克)(Light 等,1987)
Table 12.9 Immersion suit leakage (g) during simulated helicopter underwater escape (Light et al., 1987)

受试者A潜水服类型
BCD
130813245464
2102526028
31821349630
426832426490
516622813434
63322964586
719835020090
8282741624
平均值19828320345.7
标准差10312717931.6
SubjectASuit type
BCD
130813245464
2102526028
31821349630
426832426490
516622813434
63322964586
719835020090
8282741624
Mean19828320345.7
SD10312717931.6

表12.10 表12.9数据的秩次
Table 12.10 Ranks of the data in Table 12.9

受试者AB潜水服类型
CD
13241
23412
34321
43421
53421
63241
72431
83412
秩次和 (R)24271910
平均秩次3.003.382.381.25
SubjectABSuit type
CD
13241
23412
34321
43421
53421
63241
72431
83412
Total (R)24271910
Mean rank3.003.382.381.25

该分析的进行方式类似于Kruskal-Wallis非参数单因素方差分析(详见第9.8.6节)。如果是第i组中秩的总和,我们有组(此处为潜水服类型)和个受试者,则计算统计量,定义为
The analysis proceeds in a similar way to the Kruskal- Wallis non- parametric one way analysis of variance (described in section 9.8.6). If is the sum of the ranks in the ith group, and we have groups (here types of suit) and subjects, then we calculate the statistic defined by

是当原假设成立且所有组相同时,的期望值。该检验基于观察到的秩和围绕期望值的变异,这是一种常见的假设检验形式。在原假设下,服从自由度为的卡方分布。计算还有一个更简便的公式,即
The quantity is the expected value for if the null hypothesis is true and all groups are the same. The test is thus based on the variation of the observed sums of ranks around the expected values, a common form of hypothesis test. Under the null hypothesis has a distribution with degrees of freedom. Again there is a simpler version of the formula for calculating , which is

该方法不适用于二维表中每个单元格有多于一个观测值的数据。它假设每组数据中不存在平秩,但对少量平秩影响不大。
This method is not suitable for data where there is more than one observation in each cell of the two way table. It assumes that there are no ties in the data for each group, but will be little affected by a few ties.

表12.10显示了每种潜水服类型的秩和。我们计算为:
Table 12.10 shows the sums of the ranks for each type of diving suit. We calculate as:

利用自由度为3的卡方分布表B5,我们得到。(精确值为0.006。)
Using Table B5 for the Chi squared distribution with three degrees of freedom we find . (The exact value is 0.006. )

与所有多于两组的比较一样,整体显著的值并不指明差异具体在哪些组之间,尽管在本例中
As with all comparisons of more than two groups, an overall significant value does not indicate where the differences lie, although in this case

数据观察清楚显示潜水服D的漏水情况明显较少。组间配对可用Wilcoxon配对符号秩检验比较,同时需考虑多重检验的调整。但需注意,Friedman分析在两组时等价于符号检验的扩展,而非Wilcoxon检验。
inspection of the data shows clearly that suit D is far less leaky. Pairs of groups can be compared by Wilcoxon matched pair tests, making due allowance for multiple testing. Note, however, that the Friedman analysis with two groups is equivalent to an extension of the sign test rather than the Wilcoxon test.

12.4 多元回归 12.4 MULTIPLE REGRESSION

之前章节讨论的统计分析方法均未能同时考虑两个以上的变量。然而,实际数据往往涉及多个变量。在上一节中,我展示了如何将方差分析扩展到对多个分类变量(因素)组合的单一测量进行分析。方差分析仅适用于由设计实验产生的结构化数据集。在观察性研究中,我们常关注一个变量如何受多个变量影响,但数据通常是非结构化的。本节介绍多元线性回归技术,用于分析此类数据。我们通常称该方法为多元回归。
None of the methods of statistical analysis discussed in previous chapters allows us to look at more than one or two variables at a time. Frequently, however, data are collected on many variables. In the previous section I showed how analysis of variance can be extended to situations where we have one measurement recorded for combinations of several categorical variables (factors). Analysis of variance can be used only for structured data sets, which arise from designed experiments. In observational studies we are often interested in the way one variable is influenced by several variables, but the data are unstructured. This section introduces the technique of multiple linear regression, which we use to analyse that type of data. We often refer to the method as multiple regression.

第11章主要讨论了简单线性回归,用于描述两个连续变量之间的线性关系。如12.2节所述,回归方法可扩展至根据两个或更多变量预测一个变量的值。多元回归分析得到的回归模型中,因变量(或结果变量)表示为解释变量(有时称为预测变量或协变量)的组合。正如我们将看到的,解释变量不必是连续的。
Chapter 11 dealt mainly with simple linear regression, the method we use to describe the linear relation between two continuous variables. As I noted in section 12.2, regression methods can be extended to the case where we wish to predict the value of one variable from values of two or more other variables. Multiple regression analysis yields a regression model in which the dependent (or outcome) variable is expressed as a combination of the explanatory variables (sometimes called predictor variables or covariates). As we will see, it is not necessary for the explanatory variables to be continuous.

例如,假设我们希望根据身高(单位:厘米)和体重(单位:千克)预测呼吸肌力量指标PEmax(单位:cm )。我们将得到如下回归模型:
For example, suppose we wish to predict an index of respiratory muscle strength PEmax (in cm ) from height (in cm) and weight (in kg). We would obtain a regression model like the following:

数字0.147和1.024分别称为身高和体重的回归系数。它们表示PEmax随解释变量每增加一个单位(分别为1厘米和1千克)而预测的增加值。47.35是常数项,表示当体重和身高均为零时的PEmax值。与线性回归中的截距类似,它通常不具备实际意义。
The numbers 0.147 and 1.024 are called the regression coefficients for height and weight. They indicate the predicted increase in PEmax for each unit increase in the explanatory variable, here and respectively. The value of 47.35 is the constant, corresponding to PEmax when weight and height are both zero. Like the intercept in linear regression, it is not usually of great interest.

分析中还会得到每个回归系数的标准误差,据此我们可以计算变量的统计显著性及回归系数的置信区间。与方差分析和线性回归一样,残差方差衡量模型对数据的拟合程度。
From the analysis we also obtain standard errors for each regression coefficient, from which we can calculate the statistical significance of a variable and a confidence interval for the regression coefficient. As with analysis of variance and linear regression, the residual variance provides a measure of how well the model fits the data.

多元回归分析适用于以下几种情况:
There are several situations in which we may wish to perform a multiple regression analysis:

1.我们希望在研究两个变量关系时,剔除其他“干扰”变量的可能影响;

  1. we may wish to remove the possible effects of other 'nuisance' variables from a study of the relation between just two variables;

2.我们在探索潜在的预后变量时,几乎没有或没有关于哪些变量重要的先验信息;
2. we may be exploring possible prognostic variables with little or no prior information of which variables are important;

3.我们可能希望从多个解释变量中开发一个预测感兴趣的因变量的预后指数。
3. we may wish to develop a prognostic index from several explanatory variables for predicting the dependent variable of interest.

在实际中,区分这些可能性并不总是容易的,一次分析可能包含上述三种思想。每种情况下的分析方法相同。
In practice it is not always easy to distinguish these possibilities and one analysis may incorporate all three ideas. The method of analysis is the same in each case.

上述第一种可能性的一个例子是关于父母出生体重对婴儿出生体重影响的研究。Langhoff-Roos 等人(1987)分析了276名瑞典婴儿的数据,这些婴儿出生体重超过 2500 克,妊娠期为37-41周。初步多元回归分析仅考虑了三种“胎儿因素”—母亲出生体重、父亲出生体重和胎儿性别。母亲和父亲出生体重的回归系数分别为 0.214 克(标准误 0.062 克)和 0.122 克(标准误 0.049 克),均高度统计显著。随后,他们进行了包含孕前母体体重和身高、既往子女数、孕期体重增加及母亲吸烟情况的分析,这些变量均已知与出生体重相关。该更全面的分析旨在评估婴儿出生体重与父母出生体重之间观察到的关联是否可以通过父母出生体重与其他变量之间的微妙相互关系“解释”。例如,低出生体重的母亲可能更倾向于吸烟。
An example of the first of the above possibilities is given by a study of the effect of parental birth weight on infant birth weight. Langhoff- Roos et al. (1987) analysed data for 276 Swedish infants with birth weights exceeding born at 37- 41 weeks of gestation. An initial multiple regression analysis considered just three 'fetal factors' - maternal birth weight, paternal birth weight and fetal sex. The regression coefficients for maternal and paternal birth weights were (SE ) and (SE ) respectively, both highly statistically significant. They then carried out an analysis incorporating maternal pre- pregnancy weight and height, number of previous children, weight gain during pregnancy and maternal smoking, all of which are known to be associated with birth weight. This larger analysis assessed whether the observed association between infant birth weight and parents' birth weight could be 'explained' by some subtle inter- relationships between parental birth weights and the additional variables. For example, it might be that mothers who had had low birth weights are more likely to smoke.

在更全面的分析中,母亲和父亲出生体重的回归系数分别为 0.187 克(标准误 0.062 克)和 0.157 克(标准误 0.047 克)。两者仍然高度显著,系数的大小变化不大。我们可以得出结论,父母与婴儿出生体重的关系不能通过其他变量的变异来解释,因此可以推断这种关联是真实存在的。鉴于数据的性质,我们也可以合理推断这种关联具有因果性。然而,如下所示,该关联较弱。与简单线性回归一样,回归系数被解释为预测变量增加一个单位时,因变量的估计增加量。在本例中,将系数乘以100更为有用,因此回归系数被解释为母亲和父亲出生体重每增加100克,婴儿出生体重分别增加19克和16克。注意,解释系数时需要知道测量单位。
The regression coefficients for maternal and paternal birth weights in the larger analysis were (SE ) and (SE ) respectively. Both are still highly significant and the magnitudes of the coefficients are little changed. We can conclude that the relation between parental and infant birth weights cannot be explained by variation in the other variables, and thus can infer that the association is a real one. Given the nature of the data we may reasonably also infer that the association is causal. However, the association is weak, as we shall see below. As with simple linear regression, the regression coefficients are interpreted as the estimated increase in the outcome variable for an increase of one unit in the predictor variable. In this example it is helpful to multiply by 100, so that the regression coefficients are interpreted as indicating an increase of and in infant birth weight for every extra of maternal and paternal birth weight respectively. Notice that to interpret the coefficients we need to know the units of measurement.

当我们知道希望包含在模型中的变量时,多元回归相对简单。
Multiple regression is relatively straightforward when we know which

困难出现在我们希望从大量变量中识别与因变量相关的变量,并评估所得模型与数据的拟合度时。因此,我们试图在同一数据上进行探索性和验证性分析。特别是多重显著性检验的使用方式会引发问题。
variables we wish to have in the model. Difficulties occur when we wish to identify from a large number of variables those which are related to the dependent variable, and also assess how well the model obtained fits the data. We are thus trying to carry out exploratory and confirmation analyses on the same data. Problems arise particularly from the way in which multiple significance testing is used.

多元回归分析将通过一项包含25名囊性纤维化患者的数据研究(O'Neill 等,1983)进行说明,其中部分数据已列出。
Multiple regression analysis will be illustrated using data from a study of 25 patients with cystic fibrosis (O'Neill et al., 1983), some of which were

表12.11 25名囊性纤维化患者的数据(O'Neill 等,1983)
Table 12.11 Data for 25 patients with cystic fibrosis (O'Neill et al., 1983)

编号年龄性别身高体重BMPFEV1RVFRCTLCPEmax
17010913.1683225818313795
27111212.9651944924513485
38012414.16422441268147100
48112516.2674123414612485
58012721.5935220213110495
69013017.5684430815511880
711113930.7892830517911965
812115028.46918369198103110
912014625.1672431219412870
1013115531.5682341322513695
1113015639.9893920614295110
1214115342.1902625319112190
1314016045.69345174139108100
1415115851.293451581249080
1516116035.96631302133101134
1617115334.87029204118120134
1717017444.77049187104103165
1817117660.19229188129130120
1917017142.66938172130103130
2019115637.272212161198185
2119017454.6863718411810185
2220017864.08634225148135160
2323018073.8975717110898165
2423017551.1713322413111395
2523017971.59552225127101195
SubAgeSexHeightWeightBMPFEV1RVFRCTLCPEmax
17010913.1683225818313795
27111212.9651944924513485
38012414.16422441268147100
48112516.2674123414612485
58012721.5935220213110495
69013017.5684430815511880
711113930.7892830517911965
812115028.46918369198103110
912014625.1672431219412870
1013115531.5682341322513695
1113015639.9893920614295110
1214115342.1902625319112190
1314016045.69345174139108100
1415115851.293451581249080
1516116035.96631302133101134
1617115334.87029204118120134
1717017444.77049187104103165
1817117660.19229188129130120
1917017142.66938172130103130
2019115637.272212161198185
2119017454.6863718411810185
2220017864.08634225148135160
2323018073.8975717110898165
2423017551.1713322413111395
2523017971.59552225127101195

Sub 受试者编号
Sub Subject number

性别 0 = 男性,1 = 女性
Sex 0 = male, 1 = female

BMP 体重指数(体重/身高),按年龄特定的正常个体中位数的百分比表示
BMP Body mass (Weight/Height) as a percentage of the age- specific median n normal individuals

FEV 一秒钟用力呼气量
FEV Forced expiratory volume in 1 second

RV 残气量
RV Residual volume

FRC 功能残气量
FRC Functional residual capacity

TLC 总肺容量
TLC Total lung capacity

PEmax 最大静态呼气压(cm H₂O)
PEmax Maximal static expiratory pressure (cm H:O)

如表3.1所示。表12.11展示了因变量PEmax,它是这些患者营养不良的一个指标,以及各种可能的解释变量,其中几个与体型或肺功能相关。
shown in Table 3.1. Table 12.11 shows the dependent variable, PEmax, which is a measure of malnutrition in these patients, and various possible explanatory variables, several of which relate to body size or lung function.

12.4.1 分类变量 12.4.1 Categorical variables

如果我们在回归模型中包含一个二元变量,该变量对每个个体取值为0或1,例如表示非吸烟者和吸烟者,则回归系数表示在调整模型中其他变量差异后,二元变量定义的组之间因变量的平均差异。这是因为两组代码的差值为1。如果模型包含两个解释变量,其中一个是连续变量,另一个是二元变量,那么我们可以将分析视为为两个组分别拟合两条平行线,表示因变量对连续自变量的简单线性回归。这种分析称为协方差分析;在第11.12.1节中也有简要讨论。
If we include in the regression model a binary variable having values 0 or 1 for each individual, for example indicating non- smokers and smokers, the regression coefficient indicates the average difference in the dependent variable between the groups defined by the binary variable, adjusted for any differences between the groups with respect to the other variables in the model. This is because the difference between the codes for the groups is one. If the model contains two explanatory variables, one of which is continuous and the other binary, then we can think of the analysis as fitting two parallel lines representing simple linear regression of the dependent variable on the continuous independent variable for each of the two groups. This analysis is known as analysis of covariance; it was also discussed briefly in section 11.12.1.

我们也可以处理具有两个以上类别的分类变量。例如,如果我们有一个婚姻状况变量,编码为1代表已婚,2代表单身,3代表离婚、丧偶或分居,那么如果直接将该变量纳入分析,就会不合理地假设编码1、2、3之间的关系是线性的。我们可以通过创建两个新的二元变量(通常称为虚拟变量)来解决这个问题,例如定义为:
We can also deal with categorical variables that have more than two categories. For example, if we have a variable for marital status coded 1 for married, 2 for single, and 3 for divorced, widowed or separated, then if we were to put this variable in an analysis as it stands we would be imposing the unreasonable assumption that the relation was linear with the codes 1, 2 and 3. We can get round this by creating two new binary variables (often called dummy variables), for example defined as:

【1】如果是单身则为1,否则为0;【2】如果是离婚、丧偶或分居则为1,否则为0。

  1. 1 if single, 0 otherwise; 2. 1 if divorced, widowed or separated, 0 otherwise.

对于已婚者,这两个变量均为零。如果变量(1)显著,则说明已婚与单身之间的因变量存在显著差异,变量(2)同理。一般来说,对于有 个类别的变量,需要 个虚拟变量。通常最好同时拟合所有或不拟合任何虚拟变量,以便整体评估该分类变量是否与因变量相关,但有时也可以将虚拟变量视为独立实体来考虑。
For a married person both of these variables will be zero. If the variable (1) is significant then the dependent variable is significantly different between those who are married or single, and similarly for (2). In general we need dummy variables for categories. It is often best to fit all or none of the dummy variables to get an overall assessment of whether that categorical variable is associated with the dependent variable, but it is sometimes reasonable to consider dummy variables as separate entities.

如果类别是有序的,则必须在分析中注意这一点。上述方法不满足这一要求,但使用原始编码变量可能是合理的。例如,我们可能有一个编码为1至4的变量,代表疾病的不同进展阶段。这等同于检验线性趋势,类似于单因素方差分析和趋势卡方检验(见11.15节)。我们也可以用这种方法作为处理连续变量的另一种方式,特别是当与因变量的关系明显非线性时。例如,我们可以创建
If the categories are ordered, then we must as usual take note of this in the analysis. The above approach does not meet this requirement, but it may be reasonable to use the variable as it stands, with the codes given. For example, we may have a variable coded 1 to 4 representing progressive stages of disease. This is the same as investigating a linear trend, as was described for one way analysis of variance and the Chi squared test for trend (see section 11.15). We can also use this approach as an alternative way of dealing with continuous variables, especially when the relation with the dependent variable is clearly non- linear. We could, for example, create

一个编码为1至5的新变量,表示不同的年龄组。每天吸烟数量常常以这种方式处理。
a new variable with codes from 1 to 5 indicating different age groups. The number of cigarettes smoked per day is often treated in this way.

12.4.2 选择模型的不同方法 12.4.2 Different approaches to choosing a model

有时我们事先知道希望纳入多元回归模型的变量。在这种情况下,直接拟合包含所有这些变量的回归模型很简单。父母出生体重的研究就是这类情况。非显著变量可以被剔除,随后重新分析。但对此没有硬性规定。有时由于以往经验表明某变量重要,保留该变量是合理的。在大样本中,剔除非显著变量对其他回归系数影响较小。策略还取决于分析目的。如果目的是识别重要预测变量,那么剔除对模型贡献不大的变量是合理的,通常这类变量的 值超过0.05。相关问题将在12.4.10节进一步讨论。
Sometimes we know in advance which variables we wish to include in a multiple regression model. Here it is straightforward to fit a regression model containing all of those variables. The study of parental birth weight was of this type. Variables that are not significant can be omitted and the analysis redone. There is no hard rule about this, however. Sometimes it is desirable to keep a variable in a model because past experience shows that it is important. In large samples the omission of non- significant variables will have little effect on the other regression coefficients. The strategy will also depend upon the purpose of the analysis. If the aim is to identify important predictor variables then it makes sense to omit variables that do not contribute much to the model, which are usually taken to be those for which the value exceeds 0.05. I discuss these issues further in section 12.4.10.

多元回归模型中每个变量的统计显著性,可以通过计算回归系数与其标准误的比值,并将该值与自由度为 分布进行比较得到,其中 是样本量, 是模型中的变量数。 统计量计算公式为 ,其中 是回归系数,该统计量等于比较包含该变量模型与不包含该变量模型所解释的额外变异的 统计量的平方根。后一方法必须用于评估一组表示分类变量的虚拟变量的联合效应。
The statistical significance of each variable in the multiple regression model is obtained simply by calculating the ratio of the regression coefficient to its standard error and relating this value to the distribution with degrees of freedom, where is the sample size and is the number of variables in the model. The statistic, which is calculated as , where is the regression coefficient, is equal to the square root of the statistic for the extra variability explained by the present model in comparison with the model excluding that particular variable. The latter approach must be used to assess the combined effect of a set of dummy variables representing a categorical variable.

在医学研究中,常常面临多个候选变量,我们希望从中获得某种意义上的“最佳”模型。这里的“最佳”是指模型预测因变量的能力,或等价地,解释因变量变异的能力。寻找最佳模型的方法多样,且无一种方法明显优于其他。当不同方法得出不同结果时,可能需要一定的主观判断。本章作为入门介绍,以下内容不应视为对众多问题的全面讨论。多元回归模型的解释将在介绍各种策略后进行。
In medical research it is more common to be faced with several contenders from which we wish to obtain the model which is, in some sense, best. By 'best' we refer to the ability of the model to predict the dependent variable or, equivalently, to explain variation in that variable. There are several ways of trying to find the best model, none of which can be taken as clearly better than the rest. Some subjective assessment may be necessary, especially when different approaches yield different answers. This chapter is intended as an introduction, so that the following exposition should not be taken as a comprehensive discussion of the many issues. Interpretation of multiple regression models will be discussed after the various strategies have been introduced.

12.4.3 向前逐步回归 12.4.3 Forward stepwise regression

多元数据分析的第一步通常是检查每个潜在解释变量与感兴趣的结果变量之间的简单关系,忽略所有其他变量。
The first step in many analyses of multivariate data is to examine the simple relation between each potential explanatory variable and the

表12.12 分别将PEmax对每个解释变量进行回归的结果
Table 12.12 Results of separately regressing PEmax on each explanatory variable

解释变量回归系数标准误t值P值
年龄4.0551.0883.730.0011
性别-19.04513.176-1.450.16
身高0.9320.2603.590.0016
体重1.1870.3013.940.0006
BMP0.6390.5651.130.27
FEV11.3540.5552.440.023
RV-0.1230.077-1.590.12
FRC-0.3190.145-2.200.038
TLC-0.3580.404-0.890.38
Explanatory variableRegression coefficientStandard errortP
Age4.0551.0883.730.0011
Sex-19.04513.176-1.450.16
Height0.9320.2603.590.0016
Weight1.1870.3013.940.0006
BMP0.6390.5651.130.27
FEV11.3540.5552.440.023
RV-0.1230.077-1.590.12
FRC-0.3190.145-2.200.038
TLC-0.3580.404-0.890.38

换句话说,我们依次对每个变量进行线性回归分析。表12.12总结了表12.11数据的这些分析结果。九个变量中有五个与PEmax显著相关()。
outcome variable of interest ignoring all the other variables. In other words, we carry out linear regression analyses on each variable in turn. Table 12.12 summarizes these analyses for the data in Table 12.11. Five of the nine variables are significantly associated with PEmax

向前逐步回归分析以此分析作为起点。该方法可以分解为几个简单步骤:
Forward stepwise regression analysis uses this analysis as its starting point. The method can be broken down into a few simple steps:

(a) 找出与因变量关联最强的单个变量,并将其纳入模型。
(a) Find the single variable that has the strongest association with the dependent variable and enter it into the model

关联最强的变量是斜率最显著的变量(即值最小者)。这相当于找到与因变量相关性最高的变量。
The variable with strongest association is that with the most significant slope (i.e. that with the smallest value). This is equivalent to finding the variable that is most highly correlated with the dependent variable.

(b) 在未纳入模型的变量中,找出加入当前模型后能解释剩余变异量最大的变量。
(b) Find the variable among those not in the model that, when added to the model so far obtained, explains the largest amount of the remaining variability

执行此步骤的方法如下。它相当于找到与当前模型残差相关性最大(忽略符号)的变量。
The method for carrying out this step is given below. It is equivalent to finding the variable with the largest correlation (ignoring sign) with the residuals from the model so far.

(c) 重复步骤
(c) Repeat step
(b) 直到加入额外变量在某个选定水平(如 )上不再具有统计学意义为止
(b) until the addition of an extra variable is not statistically significant at some chosen level such as

我们需要在某个点停止这一过程,否则最终模型将包含所有变量。这样不仅会得到一个无法使用的模型,还会对数据产生“过拟合”,其含义在第12.4.6节中有所描述。不幸的是,(或其他任何值)的截断点是任意的,且与模型拟合数据的好坏没有直接关系。
We need to stop the process at some point otherwise we will end up with all the variables in the model. As well as having an unusable model, we will have 'overfitted' the data, in a sense described in section 12.4.6. Unfortunately, the cut- off of (or any other) is arbitrary and not directly related to how well the model fits the data.

我们将通过寻找一个模型来了解逐步法的工作原理
We will see how the stepwise procedure works by finding a model to

使用表12.11中的数据预测PEmax。首先注意,在第一步中,我们不需要进行九次单独的回归分析(表12.12),而是可以通过查看表12.13所示的相关矩阵获得相同的信息。查看相关矩阵本身是有益的,因为它还显示了解释变量之间的相关关系。对于这组数据,有许多较大的相关系数:根据表B7, 对应的 。图12.2展示了相关矩阵的图形表示,每个小面板显示相应的散点图。我们可以看到数据中没有明显的异常值,但体重的分布
predict PEmax using the data in Table 12.11. Note first that for the purposes of the first step we do not need to perform nine separate regression analyses (Table 12.12), but can get the same information from looking at the correlation matrix shown in Table 12.13. It is useful to look at the correlation matrix anyway, because it also shows the correlations among the explanatory variables. For this data set, there are many large correlation coefficients: from Table B7 corresponds to . Figure 12.2 shows a graphical representation of the correlation matrix, with each small panel showing the relevant scatter diagram. We can see that there are no obvious outliers in the data, but the distribution of body mass

表12.13 PEmax与九个潜在解释变量的相关矩阵
Table 12.13 Correlation matrix for PEmax and nine potential explanatory variables

PEmax年龄性别身高体重BMPFEV1RVFRC
年龄0.613
性别-0.289-0.167
身高0.5990.926-0.168
体重0.6350.906-0.1900.921
BMP0.2300.378-0.1380.4410.673
FEV10.4530.294-0.5280.3170.4490.546
RV-0.316-0.5520.271-0.570-0.622-0.582-0.666
FRC-0.417-0.6390.184-0.624-0.617-0.434-0.6650.911
TLC-0.182-0.4690.024-0.457-0.418-0.365-0.4430.5890.704
PEmax年龄性别身高体重BMPFEV1RVFRC
PEmaxAgeSexHeightWeightBMPFEV1RVFRC
Age0.613
Sex-0.289-0.167
Height0.5990.926-0.168
Weight0.6350.906-0.1900.921
BMP0.2300.378-0.1380.4410.673
FEV10.4530.294-0.5280.3170.4490.546
RV-0.316-0.5520.271-0.570-0.622-0.582-0.666
FRC-0.417-0.6390.184-0.624-0.617-0.434-0.6650.911
TLC-0.182-0.4690.024-0.457-0.418-0.365-0.4430.5890.704
PEmaxAgeSexHeightWeightBMPFEV1RVFRC


图12.2 与表12.13对应的散点图。
Figure 12.2 Scatter diagrams corresponding to Table 12.13.

百分比(BMP)相当奇怪。表12.12和12.13均显示,最具预测力的单一变量是体重。表12.14显示了该线性回归分析的方差分析表。
percentage (BMP) is rather odd. Tables 12.12 and 12.13 both show that the most predictive single variable is weight. Table 12.14 shows the analysis of variance table for this linear regression analysis.

与体重一起纳入的最佳变量是BMP。回归模型如表12.15所示,采用多元回归模型的常规呈现方式,并附有方差分析表。每个变量的检验表明,省略该变量是否会导致显著的信息损失。它等同于与该变量缺失模型相比,模型拟合数据改进的值的平方根。因此,表12.15上半部分中BMP的检验与检验完全等价。
The best variable to include with weight turns out to be BMP. The regression model is shown in Table 12.15, in the usual style of presenting a multiple regression model, together with the analysis of variance table. The test for each variable indicates whether omitting that variable would lead to a significant loss of information. It is equivalent to this square root of the value associated with the improvement in how well the model fits the data compared with the model without that variable. Thus the test for BMP in the top half of Table 12.15 is exactly equivalent to the test

表12.14 PEmax对体重的回归分析
Table 12.14 Regression analysis of PEmax on weight

变异来源自由度平方和均方F值P值
体重回归110827.1610827.1615.560.0006
残差2316005.48695.89
总计2426832.64
Source of variationDegrees of freedomSum of squaresMean squaresFP
Regression on weight110 827.1610 827.1615.560.0006
Residual2316 005.48695.89
Total2426 832.64

残差标准差
Residual

表12.15 PEmax对体重和BMP的回归分析
Table 12.15 Regression analysis of PEmax on weight and BMP

变量系数 b标准误差 se(b)t值P值
常数项124.83037.479
体重1.6400.3904.210.0004
BMP-1.0050.581-1.730.10
VariableCoefficient bStandard error se(b)tP
Constant124.83037.479
Weight1.6400.3904.210.0004
BMP-1.0050.581-1.730.10
变异来源自由度平方和均方F值P值
体重回归110827.1610827.1615.560.0006
BMP加入11914.941914.942.990.10
残差2214090.54640.48
总计2426832.64
Source of variationDegrees of freedomSum of squaresMean squaresFP
Regression on weight110 827.1610 827.1615.560.0006
Addition of BMP11 914.941 914.942.990.10
Residual2214 090.54640.48
Total2426 832.64

残差标准差
Residual

在下方部分,我们可以看到BMP相较于仅包含体重的模型所增加的效应,在5%显著性水平下不显著,但在10%水平下显著。如果像通常那样使用5%或1%的显著性水平,我们会得出“最佳”模型仅包含体重的结论。如果使用10%的显著性水平,则会将BMP加入模型并继续分析。显著性水平的选择将在后文讨论。
in the lower part. We can see that the additional effect of BMP over that achieved by including only weight in the model is not statistically significant at the level, but is significant at the level. If, as is usual, the or level is used, we would conclude that the 'best' model is that including just weight. If we were using a level for including variables, we would add BMP to the model and continue the analysis. The choice of significance level is discussed below.

前向逐步回归作为选项在一些大型统计软件包中可用。它可以通过任何多元回归程序实现,最简单的方法是计算当前模型残差与尚未纳入模型的所有变量之间的相关性。
Forward stepwise regression is available as an option in some of the larger statistical packages. It can be carried out using any program for multiple regression, most simply by calculating the correlations between the residuals from the model so far obtained and all those variables not so far included in the model.

12.4.4 后向逐步回归 12.4.4 Backward stepwise regression

顾名思义,后向逐步法是从相反方向解决问题。其论点是我们收集这些变量的数据,是因为认为它们可能是重要的解释变量。因此,我们应拟合包含所有这些变量的完整模型,然后逐一剔除不重要的变量,直到模型中剩余的变量都显著贡献。我们使用相同的标准,比如 来判定显著性。每一步都移除对模型贡献最小的变量(或 值最大的变量),只要该 值大于预定阈值。
As its name implies, with the backward stepwise method we approach the problem from the other direction. The argument is put forward that we have collected data on these variables because we believe them to be potentially important explanatory variables. Therefore we should fit the full model, including all of these variables, and then remove unimportant variables one at a time until all those remaining in the model contribute significantly. We use the same criterion, say , to determine significantly. At each step we remove the variable with the smallest contribution to the model (or the largest value) as long as that value is greater than the chosen level.

表 12.16 预测 PEmax 的后退逐步回归模型
Table 12.16 Backward stepwise regression model to predict PEmax

变量系数 b标准误差 se(b)t 值P 值
常数项126.33434.720
体重1.5360.3644.220.0004
BMP-1.4650.579-2.530.019
FEV11.1090.5142.160.043
VariableCoefficient bStandard error se(b)tP
Constant126.33434.720
Weight1.5360.3644.220.0004
BMP-1.4650.5792.530.019
FEV11.1090.5142.160.043

方差分析:
Analysis of variance:

变异来源自由度平方和均方F 值P 值
回归315 294.465089.159.280.0004
残差2111 538.18549.44
Source of variationDegrees of freedomSum of squaresMean squaresFP
Regression315 294.465089.159.280.0004
Residual2111 538.18549.44

残差标准差 = √549.44 = 23.44
Residual SD = √549.44 = 23.44

通过这种方法得到的最终后退逐步模型包括体重、BMP 和 ,如表 12.16 所示,同时附有三变量模型的方差分析表。
The final backward stepwise model obtained in this way includes weight, BMP and , as shown in Table 12.16, together with the analysis of variance table for the three variable model.

对于该数据集,当我们以 5% 显著性水平作为变量进入模型的标准时,前进逐步法和后退逐步法得到的模型不同。两种方法通常会得到相同模型,但差异并不罕见。两者均无绝对正确之分。在本例中,我们可能选择包含三个在 5% 水平显著变量的较大模型。另一方面,模型中同时包含体重和 BMP 有些奇怪,且 BMP 的系数为负(而 BMP 与 PEmax 正相关),这可能暗示一定程度的过拟合。此例表明,单靠 P 值无法选择合适的模型。
For this data set, when we use the significance level as the criterion for inclusion of a variable in the model we get different models by the forward and backward stepwise approaches. The two methods often yield the same model, but differences are not uncommon. Neither approach is more correct than the other. In this case, we might choose the larger model as it includes three variables all significant at the level. On the other hand, it is peculiar to include both weight and BMP in the model, and the negative coefficient for BMP (which is positively correlated with PEmax) might suggest a degree of overfitting. This example shows that P values alone cannot choose an appropriate model.

12.4.5 全子集回归 12.4.5 All subsets regression

选择“最佳”模型的第三种方法是检验所有可能的模型。比较包含相同变量数的所有模型时,可以通过它们的 统计量(见下文)轻松完成,尽管我们可能希望加一个条件,即模型中的每个变量在预先设定的显著水平上均应显著。比较变量数不同的模型更为困难,因为随着变量数增加, 也会增加。一个解决方案是使用称为 的统计量,它对模型中每增加一个变量施加惩罚。对示例数据集使用此方法得到的模型与后退逐步法相同,如表 12.16 所示。全子集回归不常用,部分原因是其计算量大。
A third approach to selecting the 'best' model is to examine every possible model. It is easy to compare all models including the same number of variables by their statistics (see below), although we may wish to impose a condition that every variable in the model should be statistically significant at some pre- chosen level. Comparing models with different numbers of variables is more difficult, as we expect to increase as we continue to add more variables. One solution is to use a statistic called , which incorporates a penalty for each additional variable in the model. Using this method for the illustrative data set yields the same model as the backward stepwise approach, shown in Table 12.16. All subsets regression is not widely used, partly because it requires much more computing.

12.4.6 拟合优度 12.4.6 Goodness-of-fit

我们可以通过考虑回归能够解释的总平方和的比例,来评估模型对数据的“拟合”程度,或者等价地,评估模型对因变量的预测能力。例如,在表12.16中,模型的平方和为15294.46,因此解释的变异比例为15294.46/26832.64 = 0.57。这个统计量称为 ,通常以百分比形式表示,这里为
We can assess how well a model 'fits' the data or, equivalently, how well the model predicts the dependent variable, by considering the proportion of the total sum of squares that can be explained by the regression. For example, in Table 12.16 the sum of squares due to the model is 15294.46, so that the proportion of the variation explained is 15294.46/26832.64 = 0.57. This statistic is called , and is often expressed as a percentage, here .

即使没有任何可能的解释变量与因变量相关,随着模型中变量的增加, 的期望值也会增加。因此,我们不能用 作为决定哪些变量应包含在模型中的标准,否则最终模型会包含所有变量。这个完整模型可能几乎完全拟合观察数据,但在总体中预测关系的能力可能不如包含较少变量的模型。一些软件会生成调整后的
Even when none of the possibly explanatory variables is related to the dependent variable, the expected value of will increase as more variables are added to the model. We thus cannot use as a criterion for deciding which variables should be in the model, as we would end up with all the variables. This full model might fit the observed data almost exactly, yet may be a worse predictor of the relation in the population than a model with fewer variables. Some programs produce an adjusted ,

它补偿了当原假设成立时的偶然预测,因此更为合适。表12.16中模型的调整后 。与 不同,调整后的 在添加变量时可能会下降。
which compensates for the expected chance prediction when the null hypothesis is true, and is thus more appropriate. The adjusted for the model in Table 12.16 is . Unlike , adjusted can drop when a variable is added to the model.

在线性回归中, 与皮尔逊相关系数的平方 完全相同。对于多元回归模型, 的值被称为多重相关系数,但不能以相同方式解释。 检验是评估模型是否解释了显著比例变异的唯一方法—使用 表来评估 的显著性是完全无效且极具误导性的。
When we perform linear regression, is exactly the same as , the square of the Pearson correlation coefficient. For multiple regression models, the value of is called the multiple correlation coefficient by analogy, but it must not be interpreted in the same way. The test is the only way to assess whether a model explains a significant proportion of variability – using tables of to assess the significance of is completely invalid and wildly misleading.

粗略评估了模型整体拟合数据的程度,但我们还应检查模型对个体因变量值的预测能力。换句话说,我们应研究残差。
assesses crudely how well the model fits the data overall, but we should also examine how well the model predicts values of the dependent variable for individuals. In other words we should study the residuals.

12.4.7 残差分析 12.4.7 Analysis of residuals

残差标准差是观察值 与模型预测或拟合值之间平均差异的度量。多元回归模型可写为
The residual standard deviation is a measure of the average difference between the observed values and those predicted or fitted by the model. The multiple regression model can be written

其中 是截距; 等为回归系数; 等为模型中变量的个体取值; 是拟合或预测值。残差为 ,其中 是因变量的观察值。我们无法绘制原始多维数据,但可以通过残差图来判断模型是否合理。具体来说,应检查残差是否服从正态分布,并且模型在因变量值的整个范围内拟合效果是否均匀。
where is the intercept; , etc. are the regression coefficients; , etc. are the individual's values of the variables in the model; and is the fitted or predicted value. The residuals are given by , where is the observed value of the dependent variable. We cannot plot the original multi- dimensional data, but we can examine plots of the residuals to see if the model is reasonable. Specifically, we ought to check that the residuals have a Normal distribution and that the model is an equally good fit throughout the range of values of the dependent variable.

与线性回归(第11.10节)类似,可以绘制多种图形:
As with linear regression (section 11.10) several plots are possible:

  1. 我们可以绘制残差的正态概率图,以检查整体拟合情况并验证残差是否近似服从正态分布。正态概率图还可以帮助识别异常值,以便进一步调查。这些观测值的各变量可能均无异常,但变量组合却异常。

  2. We can produce a Normal plot of the residuals, to check the overall fit and verify that the residuals have an approximately Normal distribution. The Normal plot may identify outliers for further investigation. Such observations may have unremarkable values of all the variables, but a peculiar combination of them.

  3. 我们可以依次将残差绘制于各解释变量上。若真实关系为线性,则预期无关联。与简单线性回归类似,曲线形态表明可能需要变量变换或非线性项。

  4. We can plot the residuals against each of the explanatory variables in turn. We expect to see no association if the true relation is linear. As with simple linear regression, a curved pattern indicates that transformation or a non-linear term may be required.

  5. 我们可以将残差绘制于观测的 值上,但该图会显示强烈的负相关,帮助有限。此相关性并不表示拟合不良。

  6. We can plot the residuals against the observed values of , but this plot will show a strong negative correlation and will not be very helpful. The correlation does not indicate lack of fit.

  7. 更有用的是,我们可以将残差绘制于拟合值上。图中不应出现任何模式。尤其是残差的变异性应在拟合值范围内保持恒定。

  8. More usefully, we can plot the residuals against the fitted values. No pattern should be discernible. In particular, the variability of the residuals should be constant across the range of the fitted values.

三变量模型预测 PEmax 的残差正态概率图非常接近直线(图12.3),无理由质疑分析的有效性。
The Normal plot for the residuals from the three variable model for PEmax is very straight (Figure 12.3), and provides no reason to question the validity of the analysis.


图12.3 表12.16回归模型残差的正态概率图。
Figure 12.3 Normal plot of residuals from regression model in Table 12.16.

12.4.8 预后指数 12.4.8 Prognostic index

我们可以利用多元回归方程,为任何患囊性纤维化的个体计算因变量 的预测值。例如,使用表12.16中的模型,个体的预测 PEmax 为:
We can use the multiple regression equation to obtain a predicted value of the dependent variable for any individual with cystic fibrosis. For example, using the model in Table 12.16 the predicted PEmax for an individual is:

另一种理解预测值 的方式是将其视为预后值或预后指数。如果模型解释了因变量变异性的较大比例,则高低预测值将指示截然不同的预后。这一术语更常用于逻辑回归(第12.5节)和生存数据分析的回归模型(第13章)。
Another way of thinking of the predicted value, , is as a prognostic value or prognostic index. If the model explains a high proportion of the variability in the dependent variable, high and low predicted values will indicate widely differing prognoses. This terminology is more commonly used in connection with logistic regression (section 12.5) and regression models for analysing survival data (Chapter 13).

注意,与线性回归的情况不同,计算 的标准误差较为困难,因为它依赖于每个预测变量距离其均值的距离以及变量之间的相互关系。
Note that unlike the case for linear regression, it is difficult to calculate the standard error of because it depends upon the distance of each of

不过,一些统计软件包可以执行这些计算,从而获得置信区间。
the predictor variables from its mean and also the interrelations between the variables. Some statistical packages can perform these calculations, however, so that a confidence interval can be obtained.

12.4.9 与偏相关的关系 12.4.9 Relation to partial correlation

在第11.5节中,我描述了计算偏相关系数以在调整第三个变量影响后,检验两个变量之间关系的方法。我指出,对于这类问题,更常用的是多元回归。实际上,这两种分析是完全等价的。
In section 11.5 I described the calculation of the partial correlation coefficient to examine the relation between two variables after adjusting for the effect of a third variable. I noted that it is more usual to use multiple regression for this type of problem. In fact, the two analyses are exactly equivalent.

例子基于表11.2中的数据。调整红细胞压积(PCV)后的血液粘度与纤维蛋白原的偏相关系数 为0.212。表12.17显示了血液粘度对PCV的线性回归及加入纤维蛋白原后的多元回归的方差分析表。通过加入纤维蛋白原,第一模型残差平方和的比例变化为
The illustrative example was based on data in Table 11.2. The partial correlation between blood viscosity and fibrinogen adjusted for haematocrit (PCV), denoted , was 0.212. Table 12.17 shows analysis of variance tables for linear regression of blood viscosity on PCV, and multiple regression with fibrinogen added to the model. The proportion of the residual sum of squares from the first model that is explained by adding fibrinogen is

表12.17 血液粘度的回归分析(表11.2数据) (a) 血液粘度对红细胞压积(PCV)的回归
Table 12.17 Regression analyses of blood viscosity in Table 11.2 (a) Regression of blood viscosity on haematocrit (PCV)

变异来源自由度平方和均方F值P值
对PCV的回归19.22959.2295101.8< 0.001
残差302.72090.0907
总计3111.9504
Source of variationDegrees of freedomSum of squaresMean squaresFP
Regression on PCV19.22959.2295101.8&lt; 0.001
Residual302.72090.0907
Total3111.9504

(b) 血液粘度对PCV和纤维蛋白原的回归
(b) Regression of blood viscosity on PCV and fibrinogen

变异来源自由度平方和均方F值P值
对PCV的回归19.22959.2295103.0< 0.001
加入纤维蛋白原10.12270.12271.370.25
残差292.59820.0896
总计3111.9504
Source of variationDegrees of freedomSum of squaresMean squaresFP
Regression on PCV19.22959.2295103.0&lt; 0.001
Addition of fibrinogen10.12270.12271.370.25
Residual292.59820.0896
Total3111.9504

这等于 — 即偏相关系数的平方。在调整了红细胞压积(PCV)后,纤维蛋白原与血液黏度之间无关系的假设,无论采用哪种方法,得到的 。多元回归方法更具信息量,因为我们可以得到估计的回归系数并检查残差。
which is equal to - it is the square of the partial correlation coefficient. The hypothesis of no relation between fibrinogen and blood viscosity after adjusting for PCV gives by either approach. The multiple regression approach is more informative as we have an estimated regression coefficient and can examine the residuals.

12.4.10 评论 12.4.10 Comments

这里无法详细讨论影响多元回归分析及其解释的许多重要问题,但以下简短评论指出了一些关注点或难点。
It is not possible here to discuss in detail many of the important issues that affect multiple regression analysis and its interpretation, but the following brief comments indicate areas of interest or difficulty.

当潜在解释变量数量众多时,我们期望其中一些变量仅因偶然而显著。没有完全令人满意的方法能在不产生过于乐观结果的代价下寻找最合适的模型。面对众多候选变量,一些研究者使用单变量分析结果来决定哪些变量应在多变量分析中进一步探讨。该策略对前向逐步回归无效,但可显著减少后向逐步回归或所有子集回归的计算时间(及成本)。我不推荐预先筛选,但若采用,选择标准应宽松,例如 或更高,因为变量可能因复杂的相互关系以意想不到的方式对多元回归模型有贡献。例如,囊性纤维化数据集中,BMP单独分析的 ,但在多元回归模型中同一变量的
When there is a large number of potential explanatory variables we expect some of them to be significant just by chance. There is no completely satisfactory way of searching for the most suitable model without incurring the penalty of an over- optimistic answer. With many candidates for inclusion in the model, some researchers use the results of univariate analyses to decide which variables should be explored in the multivariate analysis. This strategy saves nothing with forward stepwise regression, but may dramatically cut computing time (and costs) for backwards stepwise or all subsets regression. I do not recommend preselection, but if it is used, selection should be based on a lax criterion, say or even higher, because variables may contribute to a multiple regression model in unforeseen ways due to complex interrelationships among the variables. As an example, the cystic fibrosis data set gave for BMP on its own, but for the same variable in the multiple regression model.

由于每一步都进行多重检验,逐步(或所有子集)回归得到的模型往往对各变量的重要性及拟合优度表现出过于乐观的估计,尤其在样本量较小时。当考虑的变量数目较多且样本量较小时,常能找到拟合看似极佳的模型。然而,例如用7个变量拟合18个观测值的模型将极不可靠。一个解决方案是建议不应将多元回归应用于小数据集。此外,预先确定可接受模型的最大规模也很有用。我发现样本量平方根作为经验法则较为实用,但即便如此也可能过于宽松。另一种建议是限制所考察的变量数量。虽无固定规则,但一个指导原则是变量数不超过 ,其中 是样本量。采用此方法,表12.11中的示例分析将不可接受,许多已发表的多元回归分析也同样不合格。
Because of the multiple testing at each step, a model derived by stepwise (or all subsets) regression is likely to be over- optimistic with respect to the importance of each variable and the goodness- of- fit, particularly in small samples. Where the number of variables being considered is large and the sample size is small, it is often possible to find a model that appears to fit remarkably well. However, a model containing, say, seven variables fitted to 18 observations will be extremely unreliable. One solution is to suggest that multiple regression should not be applied to small data sets. In addition, it is useful to decide in advance the maximum size of model that is acceptable. I have found the square root of the sample size a useful rule of thumb here, but even that may be over- generous. Alternatively, it is sometimes suggested that the number of variables examined should be restricted. Again there is no rule, but a guideline might be to look at no more than variables, where is the sample size. With this approach, the illustrative analysis of the data in Table 12.11 would not be acceptable, and nor would many published multiple regression analyses.

当样本量非常大时,即使是微小的效应也可能达到统计学显著性。例如,Rantakallio 和 Mäkinen(1984)对9795名一岁婴儿的牙齿数量数据拟合了一个模型。
When the sample is very large statistical significance can be achieved for tiny effects. For example, Rantakallio and Mäkinen (1984) fitted a model

在15个变量中,有6个变量具有统计学显著性 ,其中一个是儿童的性别 。回归系数为 ,表示男孩平均比女孩多出五分之一颗牙齿。该模型的 值仅为
to data from 9795 infants on the number of teeth at one year of age. Six of the 15 variables were statistically significant , one being the sex of the child . The regression coefficient was , indicating a mean difference of one- twentieth of a tooth in favour of boys. The value of for this model was only .

自动选择模型的方法很有用,但仍需一定的常识。例如,有时已有大量证据表明某个变量对所分析的结局具有预后意义。在这种情况下,不能因为 P 值“仅仅”为0.07而省略年龄或吸烟等变量。
Automatic procedures for selecting a model are useful, but a degree of common sense is required. For example, sometimes there is an accumulation of evidence that a particular variable is prognostically important for the outcome being analysed. It is not sensible to omit, say, age or smoking in such circumstances because P was 'only' 0.07.

当自变量高度相关时,使用自动选择的明显优势更为突出。表12.12显示身高和体重与PEmax高度相关。然而,如果将体重和身高同时放入模型,会出现奇怪的现象。表12.18展示了仅包含身高和体重的模型,两个变量都未表现出显著贡献,但模型却解释了PEmax变异性的 。原因是身高和体重高度相关(表12.13中 ),它们解释了PEmax的相似变异。 值分别为:体重 ,身高 ,体重和身高一起为 。实际上,加入身高并无益处,反而通过降低体重的回归系数并增加其标准误差,掩盖了体重的效应。逐步回归的一个重要优点就是避免出现此类误导性结果。
A definite advantage of using automatic selection can be seen when independent variables are highly correlated. Table 12.12 shows that both height and weight are highly correlated with PEmax. If we put weight and height in the model together, however, something strange happens. Table 12.18 shows the model with just height and weight. Neither variable appears to contribute significantly, yet the model explains of the variability of PEmax. The reason is that height and weight are very highly correlated ( in Table 12.13) and thus explain much the same variability in PEmax. The values of are for weight, for height, and for weight and height together. In fact, adding height gains us nothing and obscures the effect of weight by reducing its regression coefficient and increasing its standard error. It is a major advantage of stepwise regression that this type of misleading finding cannot occur.

表12.18 PEmax对体重和身高的回归分析
Table 12.18 Regression of PEmax on weight and height

变量系数 b标准误差 se(b)t 值P 值
常数项47.35573.462
体重1.0240.7871.300.21
身高0.1470.6550.220.82
VariableCoefficient bStandard error se(b)tP
Constant47.35573.462
Weight1.0240.7871.300.21
Height0.1470.6550.220.82

多元回归模型包含一些微妙但未明确说明的假设。首先,假设因变量与每个连续解释变量之间的关系是线性的。我们可以通过绘制残差与该变量的散点图来检验这一假设。若图中出现曲线趋势,则表明非线性关系更合适—此时可以考虑对解释变量进行变换。其次,假设各变量的效应是独立的,即一个变量的效应在模型中不受其他变量取值的影响。例如,如果我们怀疑身高与肺功能之间的关系在男性和女性中不同,则需要考虑在模型中添加交互项。
The multiple regression model incorporates some subtle unstated assumptions. Firstly, it is assumed that the relation between the dependent variable and each continuous explanatory variable is linear. We can examine this assumption for any variable, by plotting the residuals against that variable. Any curvature in the pattern will indicate that a non- linear relation is more appropriate - transformation of the explanatory variable may help here. Secondly, it is assumed that the effects of each variable are independent, so that the effect of one variable is the same regardless of the values of the other variables in the model. If we suspect, for example, that

注意,交互作用与两个变量之间的相关性是完全不同的概念。交互作用(无论是连续变量还是二元变量)通过构造一个新变量,即两变量的乘积,并将其加入模型中来检验。该效应通过改进拟合的 统计量进行检验。新变量使得每个变量对预测的贡献依赖于另一个变量的取值。我不建议全面检验所有交互作用,因为这会大幅增加假阳性的风险。然而,某些特定交互作用可能事先具有研究价值。
the relation between height and lung function may be different for males and females then we need to consider the possibility of adding an interaction term to the model. Note that interaction is a quite different concept from the correlation between two variables. The interaction between two variables (continuous or binary) is examined by creating a new variable which is their product and adding this to the model. The effect is tested via the statistic for the improved fit. The new variable makes the contribution of each variable to the prediction dependent upon the value of the other variable. I do not recommend the investigation of all interactions, which would greatly increase the risk of a spurious finding. Occasionally, however, a particular interaction may be of prior interest.

关于模型拟合优度的问题,已在第12.4.6节讨论。统计量 和调整后的 是评估拟合优度的一种方式,但它们衡量的是因变量实际值与预测值之间的相关性。无论变量显著性如何,也无论 多大,我们都无法从中获得对单个个体预测准确性的判断。与普通线性回归一样,残差标准差衡量观测值与预测值之间的差异,据此可以计算出 的预测区间或置信区间。
The question of how well the model fits the data was discussed in section 12.4.6. The statistics and adjusted are one way of assessing goodness of fit, but they are measures of the correlation between the observed and predicted values of the dependent variable. We cannot get any idea of the accuracy of prediction for an individual from the significance of variables nor from , however large it is. As with ordinary linear regression, the residual standard deviation gives a measure of the discrepancies between the observed and predicted values, from which a prediction or confidence interval can be obtained.

最后,由于模型可能过于乐观,理想情况下应在新的独立数据集上评估模型的预测能力,但这通常难以实现。
Lastly, because of the risk that the model may be over- optimistic, it is desirable to assess the predictive capability of a model on a new, independent set of data, but this is not usually possible.

12.4.11 结果的呈现 12.4.11 Presentation of results

在报告多元回归分析结果时,应详细说明所采用的策略(如前向逐步回归)以及所有纳入分析的变量—不仅仅是最终模型中的变量。对于分类变量,尤其是出现在模型中的变量,必须解释所使用的编码方法。例如,日吸烟量的分类方式有多种。
When reporting the results of multiple regression analysis details should be given about the strategy adopted (such as forward stepwise regression) and all the variables which were included in the analysis - not just those in the final model. For categorical variables, especially those featuring in models that are described, it is essential to explain the coding used. For example, there are numerous ways of categorizing the number of cigarettes smoked daily.

对于每个详细描述的模型,应给出回归系数及其标准误。还应报告残差标准差, 或更优选的调整后的 也可能有用。
For each model described in detail the regression coefficients and their standard errors should be given. The residual standard deviation should be given and or, preferably, adjusted may be useful too.

12.5 逻辑回归 12.5 LOGISTIC REGRESSION

前一节讨论了以连续因变量为对象的多元回归,扩展了第11章介绍的线性回归方法。在许多研究中,感兴趣的结果变量是某种状况的有无,例如对治疗的反应或是否发生心肌梗死。此时,我们不能使用普通的多元(线性)
The preceding section dealt with multiple regression with a continuous dependent variable, extending the methods of linear regression introduced in Chapter 11. In many studies the outcome variable of interest is the presence or absence of some condition, such as responding to treatment or having a myocardial infarction. We cannot use ordinary multiple (linear)

对于这类数据,我们不能使用普通的多元回归,而是可以采用一种类似的方法,称为多元线性逻辑回归,简称逻辑回归。
regression for such data, but instead we can use a similar approach known as multiple linear logistic regression or just logistic regression.

逻辑回归的基本原理与普通多元回归大致相同。主要区别在于,我们不是建立一个模型来利用一组解释变量的组合值预测因变量的值,而是预测因变量的一个变换值。
The basic principle of logistic regression is much the same as for ordinary multiple regression. The main difference is that instead of developing a model that uses a combination of the values of a group of explanatory variables to predict the value of a dependent variable, we instead predict a transformation of the dependent variable.

在解释方法之前,有必要回顾一下:如果我们有一个二元变量,并给类别赋予数值0和1,通常分别代表“否”和“是”,那么样本中这些数值的平均值就等于具有该特征个体的比例。因此,我们可以预期合适的回归模型应预测模型中任何解释变量组合下具有该特征的受试者比例(或等价地,个体具有该特征的概率)。实际上,统计上更优的方法是使用该比例的变换,如下所述。其一原因是,否则我们可能预测出不可能的概率,即超出0到1的范围。
Before explaining the method it is useful to recall that if we have a binary variable and give the categories numerical values of 0 and 1, usually representing 'No' and 'Yes' respectively, then the mean of these values in a sample of individuals is the same as the proportion of individuals with the characteristic. We might expect, therefore, that the appropriate regression model would predict the proportion of subjects with the feature of interest (or, equivalently, the probability of an individual having that characteristic) for any combination of the explanatory variables in the model. In practice a statistically preferable method is to use a transformation of this proportion, as described below. One reason is that otherwise we might predict impossible probabilities outside the range 0 to 1.

我们使用的变换称为对数几率变换,记作 。这里 是具有该特征的个体比例。例如,如果 是受试者发生心肌梗死的概率,那么 就是不发生的概率。比值 称为赔率,因此
The transformation we use is called the logit transformation, written . Here is the proportion of individuals with the characteristic. For example, if is the probability of a subject having a myocardial infarction, then is the probability that they do not have one. The ratio is called the odds and thus

是对数赔率。如果我们希望比较模型中具有或不具有某特征(如年龄大于50岁)的受试者的预测值,我们将估计一组受试者的 和另一组的 。然后我们有
is the log odds. If, from our model, we wish to compare predictions for subjects with or without a particular characteristic, such as age greater than 50, we will estimate for one group of subjects and for the other. Then we have

这就是赔率比的对数。如第10.11.2节所述,赔率比是流行病学研究中关联疾病与暴露的重要方法。 的估计值可以由 推导出来,且始终在0到1之间。如果 ,则有 ,因此
which is the log of the odds ratio. As described in section 10.11.2, the odds ratio is an important method for relating disease to exposure in epidemiological studies. The estimated value of can be derived from , and always lies in the range 0 to 1. If , then we have and thus .

表12.19总结了433名40岁及以上男性中高血压与吸烟、肥胖和打鼾的相关数据。我们可以使用逻辑回归来判断吸烟、肥胖和打鼾这几个因素中哪些能预测高血压。完整模型见表12.20(a)。
Table 12.19 summarizes some data relating hypertension to smoking, obesity and snoring in 433 men aged 40 or over. We can use logistic regression to see which of the factors smoking, obesity and snoring are predictive of hypertension. The full model is shown in Table 12.20(a). The

表12.19 40岁及以上男性中高血压与吸烟、肥胖及打鼾的关系(Norton和Dunn,1985)
Table 12.19 Hypertension in men aged in relation to smoking, obesity and snoring (Norton and Dunn, 1985)

吸烟*肥胖*打鼾*男性人数患高血压男性人数(百分比)
000605 (8%)
100172 (11%)
01081 (13%)
11020 (0%)
00118735 (19%)
1018513 (15%)
0115115 (29%)
111238 (35%)
总计43379 (18%)
Smoking*Obesity*Snoring*Number of menNumber (%) of men with hypertension
000605 (8%)
100172 (11%)
01081 (13%)
11020 (0%)
00118735 (19%)
1018513 (15%)
0115115 (29%)
111238 (35%)
Total43379 (18%)

*代码为0表示否,1表示是
*Codes are O for No, 1 for Yes

表12.20 高血压数据(表12.19)逻辑回归分析 (a) 所有变量
Table 12.20 Logistic regression analysis of the hypertension data in Table 12.19 (a) All variables

回归系数 b标准误 se(b)z值P值
常数项-2.3780.380
吸烟 (x1)-0.0680.2780.240.81
肥胖 (x2)0.6950.2852.440.015
打鼾 (x3)0.8720.3982.190.028
Regression coefficient bStandard error se(b)zP
Constant-2.3780.380
Smoking (x1)-0.0680.2780.240.81
Obesity (x2)0.6950.2852.440.015
Snoring (x3)0.8720.3982.190.028

(b) 省略吸烟变量
(b) Omitting smoking

回归系数 b标准误 se(b)z值P值
常数项-2.3920.376
肥胖 (x1)0.6950.2852.440.015
打鼾 (x2)0.8660.3972.180.029
Regression coefficient bStandard error se(b)zP
Constant-2.3920.376
Obesity (x1)0.6950.2852.440.015
Snoring (x2)0.8660.3972.180.029

每个变量的重要性可通过将 视为标准正态偏差来评估;表中显示了对应的 值。显然,吸烟与高血压无关联,但肥胖和打鼾似乎具有独立的预测价值。
significance of each variable can be assessed by treating as a standard Normal deviate; the values are shown in the table. Clearly smoking has no association with hypertension, but both obesity and snoring

省略吸烟变量(表12.20b)对其他系数影响甚微。所示分析仅涉及肥胖、吸烟和打鼾的主效应。理想情况下,我们还应调查这些因素间可能存在的重要交互作用,例如吸烟对打鼾者和非打鼾者的影响是否不同。如果二元变量已编码为0或1,可以通过创建两个变量的乘积作为新变量并加入模型,简单地检测交互作用。事实上,在此数据集中,无论是该交互项还是其他任何交互项均未达到统计学显著性。
seem to be independently prognostic. Omission of smoking (Table 12.20b) makes a minimal difference to the other coefficients. The analyses presented relate only to the main effects of obesity, smoking and snoring. Ideally we should also investigate the possibility that there may be an important interaction between two of these factors, for example that the effect of smoking is different for snorers and non- snorers. We can do this very simply if we have coded the binary variables as 0 or 1, by creating a new variable that is the product of the two variables that we are interested in. So we can create a new variable by multiplying together the values of smoking and snoring, and add this variable to the model. In fact, in this data set neither this nor any other interaction term is anywhere near to statistical significance.

三变量模型的回归方程为
The regression equation for the model with three variables is

高血压的估计概率可以通过吸烟、肥胖和打鼾这三个变量的任意组合来计算。具体来说,我们可以比较不同组的预测概率,例如打鼾者和非打鼾者。首先将 设为1,然后设为0,我们有
The estimated probability of having hypertension can be calculated from any combination of the three variables smoking, obesity and snoring. Specifically, we can compare the predicted probabilities for different groups, such as snorers and non- snorers. Setting first to 1 and then to 0 we have

以及
and

其中 分别是吸烟和肥胖的编码值。因此,我们有 。如前所述,该表达式是对数比值比,因此与打鼾相关的高血压比值比为 。因此,我们可以直接从回归系数获得变量的估计比值比。比值比的解释见第10.11.2节。我们可以将其视为打鼾者相对于非打鼾者高血压的估计概率或风险的度量。
where and are the coded values of smoking and obesity. Thus we have . As noted earlier, this expression is the log odds ratio, so that the odds ratio for hypertension associated with snoring is . We can therefore obtain the estimated odds ratio for a variable directly from its regression coefficient. The interpretation of the odds ratio was discussed in section 10.11.2. We can consider it as a measure of the estimated probability, or risk, of hypertension among snorers in relation to the risk among non- snorers.

显然,对于任何二元变量,比值比都可以由回归系数 估计为 。我们可以利用 的标准误差来获得 的置信区间,从而得到 的置信区间。打鼾回归系数的标准误差为0.398(见表12.20a),置信区间通过假设 近似服从正态分布得到。95%的 置信区间为
Clearly for any binary variable the odds ratio can be estimated from the regression coefficient as . We can use the standard error of to get a confidence interval for and thus for . The standard error of the regression coefficient for snoring was 0.398 (Table 12.20a) and a confidence interval is obtained by taking to have an approximately Normal sampling distribution. A confidence interval for is thus given by

即从0.09到1.65。比值比的95%置信区间因此为 ,即从1.10到5.22。我们有95%的把握认为打鼾者相较于非打鼾者的高血压风险在
that is, from 0.09 to 1.65. The confidence interval for the odds ratio is thus from to , that is, from 1.10 to 5.22. We are thus sure that the risk of hypertension in snorers compared with non- snorers lies in

1.1到5.2之间,这个范围较宽,但刚好排除了表示无风险增加的1.0值。
the range 1.1 to 5.2, which is rather a wide range, but just excludes the value 1.0 that indicates no increased risk.

12.5.1 计算 12.5.1 Computing

逻辑回归看起来与普通多元回归非常相似,但计算方法不同。对于每个个体,因变量(例如本例中的高血压)按定义为0或1,此时 分别为负无穷或正无穷。该分析方法采用迭代程序,通过多次循环计算,利用最大似然法获得结果。由于这种额外的复杂性,逻辑回归通常只包含在大型统计软件包或主要用于流行病学研究分析的软件中。普通多元回归中讨论的逐步选择方法同样适用于多元逻辑回归。
Logistic regression appears very similar to ordinary multiple regression, but the computing method is different. For each individual the dependent variable (hypertension in the example) is either 0 or 1 by definition, for which is minus infinity or infinity respectively. The method of analysis uses an iterative procedure whereby the answer is obtained by several repeated cycles of calculation using an approach known as maximum likelihood. Because of this extra complexity, logistic regression is only found in large statistical packages or those primarily intended for the analysis of epidemiological studies. The same stepwise options that were discussed for ordinary multiple regression can be used for multiple logistic regression.

12.5.2 判别 12.5.2 Discrimination

逻辑回归模型使我们能够根据多个预后变量预测特定结局的概率。换句话说,它允许我们区分可能或不可能患有某种疾病的患者,因此可作为诊断辅助工具。这种分析的统计术语称为判别分析。另一种可扩展至多于两个结局的判别分析方法将在12.6节讨论。
A logistic regression model enables us to predict the probability of a particular outcome in relation to several prognostic variables. In other words, it allows us to distinguish those patients likely or unlikely to have the condition, and as such can be a diagnostic aid. The statistical term for this type of analysis is discriminant analysis. An alternative method of discriminant analysis, which can be extended to more than two outcomes, is discussed in section 12.6.

与多元回归(见12.4.8节)类似,我们可以将逻辑回归模型用作预后或诊断指标。如果定义 为个体具有感兴趣特征的概率 的对数几率,则
As with multiple regression (see section 12.4.8) we can use the logistic regression model as a prognostic or diagnostic index. If we define as the logit of the probability that an individual will have the characteristic of interest, then

其中模型中有 个变量。我们可以计算研究中所有受试者的 值,并比较有无该特征的两组的分布。由此可以判断两组的区分效果,并确定最佳截断点以最大化判别力。如果所有解释变量均为二元变量,如高血压数据所示,则 只有少数可能取值。例如,表12.20(b)中的模型仅允许四组,由肥胖和打鼾的有无定义。因此 只有四个可能值,每个对应一个高血压的估计概率。表12.21展示了这些值及四组中观察到的高血压比例。
where there are variables in the model. We can calculate for all the subjects in the study and compare the distributions among those with and without the characteristic. From these we can discover how good the separation is between the two groups, and can determine the best cut- off point to maximize the discrimination. If all the explanatory variables are binary, as in the hypertension data, then there are only a few possible values for . For example, the model shown in Table 12.20(b) allows only four groups, defined by presence or absence of obesity and snoring. There are thus only four possible values for , each leading to an estimated probability of hypertension. These are shown in Table 12.21 together with the observed proportions with hypertension in the four groups. The

表12.21 预测的高血压概率 与观察比例
Table 12.21 Predicted probability of hypertension and observed proportions

肥胖打鼾Lp观察比例
-2.3920.080.09 (7/77)
-1.6970.150.09 (1/11)
-1.5260.180.18 (48/272)
-0.8310.300.31 (23/74)
ObesitySnoringLpObserved proportion
NoNo-2.3920.080.09 (7/77)
YesNo-1.6970.150.09 (1/11)
NoYes-1.5260.180.18 (48/272)
YesYes-0.8310.300.31 (23/74)

两者一致性极好。然而,很明显仅凭肥胖和打鼾的信息,我们无法准确预测高血压,尽管可以说两者均存在时高血压更常见。要在诊断上有用,组间高血压风险的差异需要更大。
agreement is excellent. It is clear, however, that we could not predict hypertension with any accuracy using information about obesity and snoring, even though we can say that hypertension is much more common if both are present than if neither is. To be useful diagnostically, we would need much greater variation in the risk of hypertension among groups.

如果模型中的一个或多个变量是连续变量,则得分 将呈连续分布。此时的问题是:由结局变量定义的各组分布差异有多大?如果重叠较少,我们可以选择一个截断点以获得良好的判别效果;但若重叠较大,模型将无临床实用价值。因此,我们利用模型创建诊断测试;此问题将在14.4节进一步讨论。
If one or more of the variables in the model is continuous the values of the score, , will have a continuous distribution. The question that then arises is: How different are the distributions in the groups defined by the outcome variable? If there is little overlap, we can choose a cut- off that will give us good discrimination, but if there is considerable overlap the model will not be clinically useful. We are thus using the model to create a diagnostic test; this problem is discussed further in section 14.4.

Peeters 等人(1987)研究了乳腺X线筛查中阳性测试结果的预测价值。在十年期间,801名女性乳腺X线检查结果为阳性并被转诊进行临床检查。302名女性在一年内通过组织学确诊为乳腺癌,10名女性因各种原因被排除,489名女性被归类为假阳性乳腺X线结果。研究人员将302名真阳性与489名假阳性进行比较,探讨是否能通过结合包括流行病学特征在内的其他信息来改善诊断。共考察了15个变量,其中5个变量—转诊时年龄、体质指数、绝经状态、乳房不适及对侧乳房的Wolfe分类—与癌症风险显著相关()。多元逻辑回归分析得出仅包含两个显著变量的模型,即转诊年龄(岁)和乳房不适(无或有;指既往疼痛、皮肤问题等病史)。他们预测真阳性概率的回归模型为
Peeters et al. (1987) examined the predictive values of a positive test result in screening for breast cancer by mammography. Over a ten year period 801 women had positive mammography results and were referred for clinical examination. Breast cancer was histologically confirmed within one year in 302 women, 10 women were excluded for various reasons, and 489 women were classified as having had a false positive mammography result. The researchers compared the 302 true positives with the 489 false positives to see if they could improve the diagnosis by incorporating other information including epidemiological characteristics. Fifteen variables were examined of which five - age at referral, body mass index, menopausal status, breast complaints, and Wolfe classification of the contralateral breast - were significantly related to risk of cancer . Multiple logistic regression analysis yielded a model containing just two significant variables, age at referral (in years) and breast complaints (No or Yes; this refers to previous history of pain, skin problems, and so on). Their regression model to predict , the probability of being a true positive, was

其中为年龄,为乳房不适()。研究人员为每位女性计算了模型预测的患乳腺癌的概率。他们将这些概率
where is age and is breast complaints , . For each woman the researchers evaluated , the probability of being diagnosed as having breast cancer predicted by their model. They divided these probabi

表12.22 787例乳腺X线测试结果与预测真阳性概率的分布(Peeters等,1987)。(排除4例缺失数据)
Table 12.22 Distribution of 787 mammography test results in relation to predicted probability of being a true positive (Peeters et al., 1987). (Four cases with missing data excluded)

测试结果真阳性测试结果的概率
0.00.10.20.30.40.50.60.70.80.9
-0.1-0.2-0.3-0.4-0.5-0.6-0.7-0.8-0.9-1.0
阴性(N = 487) (假阳性)06816799755122320
阳性(N = 300) (真阳性)0105556805628951
观察到的真阳性比例-0.130.250.360.520.520.560.750.75
Test resultProbability of a true positive test result
0.00.10.20.30.40.50.60.70.80.9
-0.1-0.2-0.3-0.4-0.5-0.6-0.7-0.8-0.9-1.0
Negative (N = 487) (False positive)06816799755122320
Positive (N = 300) (True positive)0105556805628951
Observed proportions of true positives-0.130.250.360.520.520.560.750.75

将概率分为十个等分区间,并检查这十组中阳性与阴性诊断的频率,结果见表12.22。正如他们观察到的,分布的显著重叠意味着该模型无法有效区分假阳性与真阳性。一个高度显著的模型并不保证良好的判别能力。事实上,这种情况很常见,足以辅助诊断的判别能力较为罕见。
lities into ten equal intervals and examined the frequencies of positive and negative diagnoses in the ten groups, to get the results shown in Table 12.22. As they observed, the considerable overlap of the distributions means that the model cannot help to distinguish false positives from true positives. A model that is highly significant does not guarantee good discrimination. Indeed, this type of finding is common, and discrimination good enough to aid diagnosis is rare.

一个反例是澳大利亚全科医生提供的戒烟建议研究(Richmond等,1988)。他们利用六个变量建立模型,预测哪些吸烟者能坚持戒烟六个月,预测准确率为73/100。这表明预测为不太可能戒烟的患者可接受更密集的咨询。该研究也显示模型的适用性依赖于临床情境:73%的准确率在本研究中表现良好,但在许多情况下则远远不够(参见第14.4节诊断测试讨论)。值得注意的是,随机猜测时正确率约为50%。
A counter- example is given by a study of anti- smoking advice given by general practitioners in Australia (Richmond et al., 1988). They developed a model using six variables to predict which smokers would abstain for six months, with correct prediction for 73/100 patients. This finding suggests that those patients predicted as unlikely to abstain could receive more intensive counselling. It also indicates that the adequacy of a model depends on the clinical situation: accuracy was good in this study, but would be awful in many circumstances (see discussion of diagnostic tests in section 14.4). It is worth noting that we would be right half the time by guessing at random.

并非总是需要在高低风险组之间设定截断点,有时计算风险评分更为合理。这也是第1.1节和1.4.1节中描述的用于识别心脏病高风险男性的“速算表”所采用的方法。风险评分通过以下方式计算:
It is not always desirable to impose a cut- off between high and low risk groups, but rather it may be better to calculate a risk score. This was the approach used to produce the 'ready reckoner' for identifying men at high risk of heart attack, described in sections 1.1 and 1.4.1. The risk score was calculated by taking

吸烟年数
years of smoking cigarettes

平均血压(mmHg)
mean blood pressure (mmHg)

如果男性回忆起缺血性心脏病的诊断,则加
if the man recalls a diagnosis of ischaemic heart disease

如果有心绞痛的证据(来自问卷调查),则加
if there was evidence of angina (from a questionnaire)

如果父母中有一方死于心脏病,则加
if either parent had died of heart trouble

如果他患有糖尿病,则加
if he was diabetic

(Shaper 等,1986)。这里用于计算评分的数值来源于逻辑回归系数,经过轻微调整,使得得分为1000对应于风险最高的20%男性的临界值。该评分计算于参与分析的7506名男性中。表12.23显示了风险评分分布中选定百分位数对应的得分及缺血性心脏病的估计风险。
(Shaper et al., 1986). Here the numbers used to derive the score were derived from the logistic regression coefficients, with slight modification to make a score of 1000 correspond to the cut- off for of men with the highest risk. The score was calculated for each of the 7506 men included in the analysis. Table 12.23 shows the scores corresponding to selected centiles of the distribution, together with the estimated risk of ischaemic heart disease.

表12.23 7506名40-59岁男性风险评分及风险估计在选定百分位数的分布(Shaper 等,1986)
Table 12.23 Risk scores and estimated risk at selected centiles of the distribution of risk among 7506 men aged 40-59 (Shaper et al., 1986)

风险评分分布的百分位数风险评分每千名男性每年估计风险率
106471.8
207132.4
307663.1
408123.9
508564.8
608985.8
709447.1
8010009.2
90109113.5
Centile of distribution of risk scoresRisk scoreEstimated rate of risk per 1000 men per year
106471.8
207132.4
307663.1
408123.9
508564.8
608985.8
709447.1
8010009.2
90109113.5

12.5.3 评述 12.5.3 Comments

除了用于推导回归模型的方法和检验个别变量显著性的方法之外,拟合逻辑回归模型面临的困难与第12.4.10节中讨论的普通多元回归相同。另一个主要区别是,我们不能使用散点图来绘制残差,因为所有观察数据值均为0或1。最简单的解决方案是将数据分组,如表12.21和12.22所示,比较观察到的比例和预测比例。虽然存在评估拟合优度的正式方法,但超出了本书的范围。
With the exception of the method used to derive the regression model and the method for testing the significance of individual variables, fitting a logistic regression model is subject to the same difficulties as discussed in section 12.4.10 for ordinary multiple regression. The other main difference is that we cannot use scatter plots to plot the residuals because all of the observed data values are 0 or 1. The simplest solution is to divide the data into groups, as in Tables 12.21 and 12.22, and compare the observed and predicted proportions. Formal methods exist for assessing goodness- of- fit, but they are beyond the scope of this book.

12.6 判别分析 12.6 DISCRIMINANT ANALYSIS

如12.5节开头所述,还有另一种(较早的)方法用于利用多个变量区分组别,称为判别分析。
As noted at the beginning of section 12.5, there is another (older) method for using several variables to help distinguish groups, known as discrimi

通常的情况是,我们希望找到某种变量组合,使大部分受试者被正确分类,从而有较大概率正确分配(诊断)新受试者。同时,我们通常希望从较大候选变量集中选择一个有用变量的子集进行判别。判别分析比多元回归更复杂,我不建议在没有经验或专家协助的情况下使用。大多数情况下,判别分析作为探索性技术使用,因此拥有独立数据集来评估模型效果是有价值的。
nant analysis. The usual situation is that we wish to be able to find some combination of variables that classifies a large proportion of subjects into the correct group, so that we can have a good chance of allocating (diagnosing) new subjects correctly. Simultaneously we usually wish to choose for the discrimination a subset of useful variables from a larger set of candidates. Discriminant analysis is more complicated than multiple regression, and I do not recommend that it is used without prior experience or expert assistance. In most cases discriminant analysis is used as an exploratory technique, so it is valuable to have an independent data set on which to assess how good the model is.

判别分析的基本思想如下。我们首先找到最大化组间分离的变量组合,类似于逻辑回归。当组数超过两个时,可以通过构建第二个相同变量的组合进一步区分组别。这些组合称为典型变量或判别函数。该方法最好通过分析结果图形来理解。该方法基于强假设,即所有变量在每组内均服从相同标准差的正态分布。一般认为对该原则的适度偏离是可接受的,例如包含少数二元变量,但通常难以确定允许多大灵活性而不致使方法失效。
The basic idea of discriminant analysis is as follows. We first find the combination of variables that maximises the separation between the groups, as with logistic regression. With more than two groups we can further separate the groups by constructing a second combination of the same variables. These combinations are called canonical variates or discriminant functions. The method is perhaps best understood by considering a graph showing the results of an analysis. The method is based on the strong assumption that all of the variables have a Normal distribution with the same standard deviation within each group. It is generally agreed that some departure from this principle is acceptable, for example to include a few binary variables, but as usual it is difficult to say how much flexibility can be granted before the method becomes unreliable.

Thompson等(1985)进行了一项研究,试图利用直肠活检测量区分溃疡性结肠炎、克罗恩病及其他炎症性肠病。研究了75份活检样本,包括20例正常活检、20例溃疡性结肠炎、20例克罗恩病和15例培养阳性腹泻。对12个变量进行逐步判别分析,得到包含5个变量的模型,所有变量均高度统计显著()。
Thompson et al. (1985) carried out a study to try to differentiate diagnoses of ulcerative colitis, Crohn's disease and other forms of inflammatory bowel disease using rectal biopsy measurements. Seventy- five biopsies were studied, comprising 20 patients with normal biopsies, 20 with ulcerative colitis, 20 with Crohn's disease and 15 with culture positive diarrhoea. Stepwise discriminant analysis on 12 variables yielded a model comprising five variables, all highly statistically significant .

图12.4显示了75个观察值的前两个判别函数,并叠加了表示模型预测各组80%观察值所在区域的圆圈。显然,克罗恩病组的圆圈与其他组重叠,说明模型无法提供可靠诊断。在75个观察中,模型正确预测了19/20(95%)正常组,9/20(45%)克罗恩病组,14/20(70%)溃疡性结肠炎组,以及12/15(80%)感染性腹泻组。我们预期模型在全新病例上表现更差,Thompson等发现,在一组24个新病例中,仅有14例被模型正确“诊断”,成功率为58%,低于原始数据集的72%。
Figure 12.4 shows the first two discriminant functions for the 75 observations, with superimposed circles indicating the areas in which we would expect (on the basis of the model) of observations for each group. It is clear that the circle for the Crohn's disease group overlaps those for the other groups, so that we cannot use the model to get a reliable diagnosis. Of the 75 observations, the model correctly predicted 19/20 of the normal group, 9/20 with Crohn's disease, 14/20 with ulcerative colitis, and 12/15 with infective diarrhoea. We would expect the model to do worse when a completely new set of cases are examined, and Thompson et al. found only 14 out of a new series of 24 cases were correctly 'diagnosed' by the model, a success rate compared with on the original set.

样本量再次成为问题,有建议指出每组受试者数量应至少为所考察变量数的五倍(Lachenbruch,1977)。
Sample size is again an issue, and it has been suggested that there should be at least five times as many subjects per group as variables examined (Lachenbruch, 1977).


图12.4 Thompson等(1985)数据的判别函数。
Figure 12.4 Discriminant functions from data of Thompson et al. (1985).

判别分析是一种复杂技术,本书不适合进行更详细讨论。更多细节可见相关教科书或Lachenbruch(1977)及Brown(1984)的有益论文。当只有两个组时,判别分析通常与逻辑回归分析给出类似结果(见12.5.2节)。
Discriminant analysis is a complex technique, and more detailed discussion is inappropriate in this book. More details can be found in some textbooks, or in the useful papers by Lachenbruch (1977) and Brown (1984). When there are only two groups discriminant analysis usually gives similar answers to logistic regression analysis (see section 12.5.2).

12.7 其他方法 12.7 OTHER METHODS

需要注意的是,还有许多复杂的统计方法未被入门书籍涵盖。其他多变量方法如聚类分析和因子分析也存在。时间序列方法庞大,用于处理长时间的观察数据。还有一些重要方法是从工业质量控制中借鉴而来,用于评估变量水平是否发生了(突发的)变化,应用于监测肾移植或通过每日体温测量检测排卵时间。还有专门处理多维频数表—三个或更多分类变量的交叉列表的方法。以及许多其他专门技术。
It is important to be aware that there are many other complex statistical methods that are not covered by introductory books. Other multivariate methods exist, such as cluster analysis and factor analysis. There is a vast time series methodology for dealing with long series of observations. There are important methods adapted from industrial quality control for assessing whether there has been a (sudden) change in the level of a variable, with applications in monitoring kidney transplants or detecting the time of ovulation from daily body temperature measurements. There are special methods for dealing with multi- way frequency tables - crosstabulations of three or more categorical variables. And there are many other specialized techniques.

虽然复杂问题不一定需要复杂的统计分析,但试图将复杂问题强行套入更熟悉的简单技术框架是不明智的。如果可能,应寻求专家统计建议。
While complicated problems do not necessarily require a complicated statistical analysis, it is unwise to try to force a complex problem to fit into the framework of a more familiar simpler technique. Expert statistical advice should be sought if at all possible.

练习 EXERCISES

12.1 下表显示了一项实验数据,比较五名志愿者在两种饮食下的静息代谢率(),分别为正常饮食和含能量多出50%的过量饮食(Welle 等,1986)。数据在进餐前后收集。
12.1 The table below shows data from an experiment to compare resting metabolic rates in five volunteers each given two diets, a normal diet and an overfeeding diet which contained more energy (Welle et al., 1986). Data were collected before and after eating, a meal.

受试者饮食餐前餐后
1N1.471.78
O1.722.49
2N1.421.68
O1.441.87
3N1.101.26
O1.111.36
4N0.841.11
O0.901.29
5N0.911.09
O1.001.25
SubjectDietBefore mealAfter meal
1N1.471.78
O1.722.49
2N1.421.68
O1.441.87
3N1.101.26
O1.111.36
4N0.841.11
O0.901.29
5N0.911.09
O1.001.25

N:正常饮食;O:过量饮食
N: Normal diet; O: Overfed diet

(a) 针对饮食对代谢率差异的分析,可以使用哪些方法:
(a) What methods of analysis could be used to examine the difference between the metabolic rates in relation to diet:

(i) 对餐后数据;
(i) for the post-prandial data;

(ii) 对餐前与餐后静息代谢率变化;
(ii) for the change between pre- and post- prandial resting metabolic rates;

(b) 进行分析,检验两种饮食在餐前和餐后静息代谢率变化上的差异是否显著。
(b) Carry out an analysis to see if there is a significant difference between the two diets in the change between pre- and post-prandial resting metabolic rates.

12.2 利用表12.11中的数据,找到一个合适的多元回归模型,用年龄、性别、身高、体重和预测功能残气量(FRC)。检查该模型的残差是否近似正态分布。
12.2 Using the data in Table 12.11, find a suitable multiple regression model to predict functional residual capacity (FRC) from age, sex, height, weight and . Check that the residuals from this model have a nearly Normal distribution.

12.3 对37名接受非耗竭性异基因骨髓移植的患者数据进行分析,探讨哪些变量与急性移植物抗宿主病(GvHD)的发生相关(Bagot等,1988)。下表分别显示未发生和发生GvHD两组患者的受者年龄、供者年龄、白血病类型、供者是否怀孕以及混合表皮细胞-淋巴细胞反应指数。供者怀孕(Preg)编码为0表示否,1表示是。白血病类型编码为1(急性髓系白血病-AML)、2(急性淋巴细胞白血病-ALL)或3(慢性髓系白血病-CML)。各组按指数值排序。(表中还显示了生存时间,此处不作分析。)
12.3 Data from 37 patients receiving a non- depleted allogeneic bone marrow transplant were examined to see which variables were associated with the development of acute graft- versus- host disease (GvHD) (Bagot et al., 1988). The table below shows separately for the groups who did not and did develop GvHD, the age of the recipient and

供者怀孕(Preg)编码为0表示否,1表示是。白血病类型编码为1(急性髓系白血病-AML)、2(急性淋巴细胞白血病-ALL)或3(慢性髓系白血病-CML)。各组按指数值排序。(表中还显示了生存时间,此处不作分析。)
donor, the type of leukaemia, whether or not the donor had been pregnant and an index of mixed epidermal cell- lymphocyte reactions. Donor pregnancy (Preg) is coded 0 for No and 1 for Yes. Type of leukaemia is coded 1 (acute myeloid leukaemia - AML), 2 (acute lymphocytic leukaemia - ALL) or 3 (chronic myeloid leukaemia - CML). Each group is ordered by their index values. (Also shown is the survival time, which is not used here.)

患者受者年龄供者年龄类型怀孕指数生存时间(天)
无GvHD患者
12723200.2795
21318200.311385*
31919100.39465
42122200.48810
52838200.491497*
62220200.501181
71919200.81993*
82023200.82138
93336100.86266
101819100.92579*
111720201.10600*
123121301.521182*
132338201.88841*
141715202.011364*
152616202.40695*
162825102.451378*
172421112.60736*
181820202.641504*
192425113.78849
202024304.721266*
有GvHD患者
212335111.10186
222135211.1641
232123301.45667*
243343301.50112
252924311.85572*
264235212.3045
272731302.341019*
284329212.44479
292220103.70190
303539113.73100
311614104.13177
PatientRecipient ageDonor ageTypePregIndexSurvival time (days)
Patients without GvHD
12723200.2795
21318200.311385*
31919100.39465
42122200.48810
52838200.491497*
62220200.501181
71919200.81993*
82023200.82138
93336100.86266
101819100.92579*
111720201.10600*
123121301.521182*
132338201.88841*
141715202.011364*
152616202.40695*
162825102.451378*
172421112.60736*
181820202.641504*
192425113.78849
202024304.721266*
Patients with GvHD
212335111.10186
222135211.1641
232123301.45667*
243343301.50112
252924311.85572*
264235212.3045
272731302.341019*
284329212.44479
292220103.70190
303539113.73100
311614104.13177
患者受者年龄供者年龄类型怀孕指数生存时间(天)
323935214.5280
332825314.52142
342932304.711105*
352319305.07803*
363334309.001126*
3719201010.11114
PatientRecipient ageDonor ageTypePregIndexSurvival time (days)
323935214.5280
332825314.52142
342932304.711105*
352319305.07803*
363334309.001126*
3719201010.11114

(a) 使用适当的检验比较两组的前五个变量。哪些变量与移植物抗宿主病的发生显著相关()?
(a) Use appropriate tests to compare the first five variables in the two groups. Which variables are significantly associated with the development of graft versus host disease

(b) 使用多元逻辑回归分析哪些变量与GvHD显著相关()。提示:创建两个新的“哑变量”表示疾病组2和3,并使用对数转换的指数值。
(b) Use multiple logistic regression to see which variables are significantly related to GvHD (with . (Hint: Create two new 'dummy' variables indicating disease groups 2 and 3, and use log transformed index values.)

(c) 计算模型中每个二元变量与GvHD风险的比值比及其90%的置信区间。
(c) Calculate the odds ratio for the risk of GvHD in relation to each binary variable in the model, with a confidence interval.

12.4 使用多元逻辑回归构建预测指标,根据348例接受瓣膜置换术前常规冠状动脉造影的瓣膜性心脏病患者数据预测显著冠状动脉疾病(Ramsdale等,1982)。采用前向逐步选择法,变量纳入标准为。所得预测指标基于包含七个变量的模型。
12.4 Multiple logistic regression was used to construct a prognostic index to predict significant coronary artery disease from data on 348 patients with valvular heart disease who had undergone routine coronary arteriography before valve replacement (Ramsdale et al., 1982). Forward stepwise selection was used, using as the criterion for including variables. The prognostic index obtained was based on a model containing seven variables.

(a) 缺血性心脏病家族史的回归系数(编码为 )为1.167。与阳性家族史相关的显著冠状动脉疾病的估计优势比是多少?
(a) The regression coefficient for a family history of ischaemic heart disease (coded ) was 1.167. What is the estimated odds ratio for having significant coronary artery disease associated with a positive family history?

(b) 模型中的一个变量是估计的总吸烟量,计算方法为每年平均吸烟量乘以吸烟年数。回归系数为每1000支香烟0.0106。吸烟总量达到多少时,其风险与缺血性心脏病家族史相当?将此数值换算为每天吸烟20支的吸烟年数。
(b) One of the variables in the model was the estimated total number of cigarettes ever smoked, calculated as the average number smoked annually the number of years smoking. The regression coefficient was 0.0106 per 1000 cigarettes. What total number of cigarettes ever smoked carries the same risk as a family history of ischaemic heart disease? Convert this figure into years of smoking 20 cigarettes per day.

(c) 与无家族史且不吸烟者相比,具有缺血性心脏病家族史且每天吸烟20支、持续30年的人,其重大冠状动脉疾病的优势比是多少?
(c) What is the odds ratio for major coronary artery disease for someone with a family history of ischaemic heart disease who had smoked 20 cigarettes a day for 30 years compared with a non-smoker with no family history?

12.5 对于肺移植,供体肺的大小最好与受体肺相似。总肺容量(TLC)难以测量,因此能够根据其他信息预测TLC非常有用。下表显示了32例心肺移植受体的移植前TLC(通过全身体积描记法获得),以及他们的年龄、性别和身高(Otulana等,1989年)。
12.5 For lung transplantation it is desirable for the donor's lungs to be of a similar size as those of the recipient. Total lung capacity (TLC) is difficult to measure, so it is useful to be able to predict TLC from other information. The following table shows the pre- transplant TLC of 32 recipients of heart- lung transplants, obtained by whole- body plethysmography, and their age, sex and height (Otulana et al., 1989).

年龄性别身高 TLC (厘米)(升)年龄性别身高 TLC (厘米)(升)
1351493.4017301726.30
2111383.4118211636.55
3121483.8019211646.60
4161563.9020201896.62
5321524.0021341826.89
6161574.1022431846.90
7141654.4623351747.00
8161524.5524391777.20
9351774.8325431837.30
10331585.1026371757.65
11401665.4427321737.80
12281655.5028241737.90
13231605.7329201628.05
14521785.7730251808.10
15461695.8031221738.70
16291736.0032251719.45
AgeSexHeight TLC (cm)(l)AgeSexHeight TLC (cm)(l)
135F1493.401730F1726.30
211F1383.411821F1636.55
312M1483.801921F1646.60
416F1563.902020M1896.62
532F1524.002134M1826.89
616F1574.102243M1846.90
714F1654.462335M1747.00
816M1524.552439M1777.20
935F1774.832543M1837.30
1033F1585.102637M1757.65
1140F1665.442732M1737.80
1228F1655.502824M1737.90
1323F1605.732920F1628.05
1452M1785.773025M1808.10
1546F1695.803122M1738.70
1629M1736.003225M1719.45

(a) 多元回归模型包括年龄、性别和身高,能多大程度上预测个体的肺活量?
(a) How well can an individual's lung capacity be predicted from a multiple regression model including age, sex and height?

(b) 将刚才得到的结果与仅用身高进行线性回归的结果进行比较。
(b) Compare the result just obtained with that derived from linear regression on height alone.

(c) 计算身高为平均值者的线性回归肺活量的 95% 预测区间。
(c) Calculate the prediction interval from the linear regression on height for someone with average height.

(d) 我们如何调查肺活量与身高之间的关系是否在男性和女性中相同?
(d) How could we investigate whether the relation between lung capacity and height is the same for males and females?

13 生存时间分析 13 Analysis of survival times

13.1 引言 13.1 INTRODUCTION

在大多数研究中,数据是测量值和属性的混合。前面四章介绍了各种研究设计中定量和定性数据的分析方法。另一类数据则是关注某事件发生所需时间。当我们记录从某一固定起点(如手术)到受试者死亡的时间时,最常见的数据来源即为此。因此,我们通常称之为生存时间或生存数据,对生存时间的统计处理称为生存分析。正如我们将看到的,类似数据也出现在其他情境中,但通常仍使用相同的术语。
In most studies the data are a mixture of measurements and attributes. The preceding four chapters have presented methods for the analysis of both quantitative and qualitative data for various study designs. Another type of data arises when interest is focused on the time taken for some event to occur. One of the most common sources of such data is when we record the time from some fixed starting point, such as surgery, to the death of the subject. For this reason we usually refer to survival times or survival data and the statistical treatment of survival times is known as survival analysis. As we shall see, similar data arise in other situations, but it is customary to stick to the same terminology.

在临床研究中,生存时间通常指死亡时间、某特定症状出现时间,或疾病缓解后复发时间。尽管研究期结束时间通常定义明确,但起始时间则可能较模糊。例如,通常无法准确知晓某人患病多久,因此诊断日期常作为最佳替代。对于某些疾病,这两个日期可能相差甚远。
In clinical studies survival times often refer to the time to death, to development of a particular symptom, or to relapse after remission of disease. Although there is usually a clear definition of the end of the time period of interest, the start may be less well defined. It is, for example, rarely possible to know how long somebody has had a disease, so the date of diagnosis is often the best alternative. For some diseases these two dates can be very different.

生存时间有一个固有特征,使其不适合用前几章介绍的任何方法分析,即我们几乎从未在所有受试者身上观察到感兴趣的事件。例如,在比较不同乳腺癌手术患者生存情况的研究中,尽管患者会被随访数年,但许多人在研究结束时仍然存活。对于这些患者,我们不知道他们何时死亡,只知道他们在研究结束时仍然存活。因此,我们也不知道他们从手术开始的生存时间,只知道生存时间至少超过了他们在研究中的随访时间。我们称这种生存时间为删失时间(censored),表示观察期在感兴趣事件发生前被截断。注意,感兴趣事件通常是不良事件,如死亡,因此“感兴趣”是科学上的,而非临床上的。
There is one inherent feature of survival times that makes them unsuitable for analysis by any of the methods described in the preceding chapters, which is that we almost never observe the event of interest in all subjects. For example, in a study to compare the survival of patients having different types of surgery for breast cancer, although the patients will be followed up for several years there will be many who are still alive at the end of the study. For these patients we do not know when they will die, only that they are still alive at the end of the study. Nor, therefore, do we know their survival time from surgery, only that it will be longer than their time in the study. We call such survival times censored, to indicate that the period of observation was cut off before the event of interest occurred. Note that as the event of interest is usually something that is undesirable, such as death, the 'interest' is scientific, not clinical.

如果所有受试者的随访时间完全相同,或许可以使用第9章介绍的秩次方法分析生存时间,并将所有删失时间赋予相同的最高秩次。
If all subjects were followed for exactly the same length of time it would perhaps be possible to use the rank methods introduced in Chapter 9 for analysing survival times, giving all censored times the equal highest rank.

然而,患者的随访时间几乎总是不同。此外,患者可能在研究结束前退出,例如迁居他处。退出导致另一种类型的删失观察。
However, patients are nearly always followed for varying lengths of time. In any case patients may leave the study before the end, perhaps moving to a different area. Withdrawals thus lead to censored observations of a different type.

图13.1展示了患者在研究中的不同进展方式。图中显示了一个六个月的招募期和随后12个月的观察期。患者的观察时间因此介于12至18个月之间,最近招募的患者观察时间最短。图13.1显示四名患者死亡,四名患者在研究结束时仍存活。另有两名患者在研究结束前退出。因此,我们有四个确切的生存时间和六个删失时间,如表13.1所示,星号表示删失生存时间。分析生存数据时,我们忽略不同的起始时间,通常将观察数据按生存时间排序。图13.2展示了这些调整的效果。
Figure 13.1 illustrates the different ways in which patients can proceed through a study. It shows a six month period during which patients are recruited to the study, and a further 12 months of observation. The patients are thus observed for between 12 and 18 months, the most recently accrued patients being observed for the shortest time. Figure 13.1 shows that four patients died and four were still alive at the end of the study. Two other patients withdrew from the study before the end. We thus have four firm survival times and six censored times, as shown in Table 13.1, where the asterisk denotes a censored survival time. We ignore the different starting times when analysing survival data, and it helps to order the observations by survival time. Figure 13.2 shows the effect of these changes.

对于此类数据,我们常希望估计个体在给定时间段(如一年)内存活的概率。若有两个或更多组,还会关注比较它们的生存情况。本章介绍解决这些及其他生存数据相关问题的方法。为方便起见,假设数据已按生存时间升序排序。(计算机程序可能要求如此。)
With data of this type we often wish to estimate the probability of an individual surviving for a given time period such as one year. With two or more groups we will also be interested in comparing their survival experience. This chapter introduces methods to answer these and other questions relating to survival data. For convenience, I shall assume that the data have already been sorted into ascending order of survival times. (Computer programs may require this.)

医学生存数据的分析自
The analysis of medical survival data has become widespread since the


图13.1 显示患者在不同时间进入研究以及已知生存时间 和删失生存时间 的示意图。
Figure 13.1 Diagram showing patients entering a study at different times and the observation of known and censored survival times.

表13.1 图13.1中患者的生存时间
Table 13.1 Survival times for patients shown in Figure 13.1

患者入组时间(月)死亡或删失时间(月)死亡或删失生存时间
10.011.8D11.8
20.012.5C12.5*
30.418.0C17.6*
41.24.4C3.2*
51.26.6D5.4
63.018.0C15.0*
73.44.9D1.5
84.718.0C13.3*
95.018.0C13.0*
105.810.1D4.3
PatientTime at entry (m)Time at death or censoring (m)Dead or censoredSurvival time
10.011.8D11.8
20.012.5C12.5*
30.418.0C17.6*
41.24.4C3.2*
51.26.6D5.4
63.018.0C15.0*
73.44.9D1.5
84.718.0C13.3*
95.018.0C13.0*
105.810.1D4.3

*删失观察
*censored observation


图13.2 图13.1重新组织以对应分析方法。
Figure 13.2 Figure 13.1 reorganized to correspond to method of analysis.

1970年代初期,当时开发了新的方法。 本章中描述的大多数方法在Peto等人(1976年和1977年)的两篇优秀论文中有更详细的讨论,尤其是第二篇论文。这些论文还包含了大量关于生存时间研究设计和执行的实用建议。
early 1970s when new methods were developed. Most of the methods described in this chapter are discussed in much more detail in two excellent papers by Peto et al. (1976 and 1977), especially in the second paper. These papers also contain a wealth of practical advice about the design and execution of studies of survival times.

13.2 生存概率 13.2 SURVIVAL PROBABILITIES

从一组观察到的生存时间(包括删失时间)样本中,我们可以估计在相同条件下该人群中存活超过某一特定时间的比例。
From a set of observed survival times (including censored times) from a sample of individuals we can estimate the proportion of the population of

例如,我们可以利用肝移植患者的研究数据估计新患者在移植后存活超过某一时间的概率(前提是原始样本具有代表性)。该方法巧妙之处在于不仅正确处理删失观察,还利用了删失前的全部信息。该方法生成的图表有多种名称:生命表、生存曲线、Kaplan-Meier曲线。
such people who would survive a given length of time in the same circumstances. For example, we can use data from a study of patients having liver transplants to estimate the probability of new patients surviving a given length of time after transplantation (with the usual proviso about the representativeness of the original sample). The method is clever in that it not only makes proper allowances for those observations that are censored, but also makes use of the information from these subjects up to the time when they are censored. The method yields a graph or a table, which goes under various names: life table, survival curve, Kaplan- Meier curve.

13.2.1 Kaplan-Meier生存曲线 13.2.1 Kaplan-Meier survival curve

通过将时间划分为许多小区间,可以计算出存活特定时间长度的概率。例如,肝移植患者存活两天的概率可以看作是存活第一天的概率,乘以在存活第一天的条件下存活第二天的概率。第二个概率称为条件概率。如果我们用 表示在已经存活前99天的条件下存活第100天的概率,那么肝移植后存活100天的总体概率为
The probability of surviving a given length of time can be calculated by considering time in many small intervals. For example, the probability of a patient surviving two days after a liver transplant can be considered to be the probability of surviving one day, multiplied by the probability of surviving the second day given that the patient survived the first day. This second probability is known as a conditional probability. If we write as the probability of surviving the hundredth day conditional on having already survived the first 99 days, then the overall probability of surviving 100 days after a liver transplant is given by

存活第100天的概率简单地估计为在已知第99天仍存活的样本中,第100天仍存活的比例。因此,在无人死亡的日子里,概率 为1,这简化了计算,因为只需计算至少有一人死亡的日子的概率。
The probability of surviving the 100th day is estimated simply as the proportion of the sample surviving that day of those still known to be alive after 99 days. The probability is thus 1 on days when nobody dies, so the calculations are simplified by the fact that it is only necessary to calculate the probabilities for days on which at least one person dies.

存活曲线的计算将在一个小数据集上进行说明,该数据来自一个旨在预测海上晕动症的研究项目(Burns, 1984)。受试者被置于一个安装在液压活塞上的立方形舱内,接受为期两小时的垂直运动(称为“升沉”)。研究终点是受试者首次呕吐的时间(称为“明显呕吐”)。部分受试者虽然未呕吐但请求提前终止实验,产生删失观察;其他人成功存活两小时。共研究了21名受试者,运动频率为 ,加速度为 ,其中14人两小时内未呕吐存活。其余7人的存活时间(分钟)如下:
The survival curve calculations will be illustrated on a small data set arising from a research programme aimed at the prediction of motion sickness at sea (Burns, 1984). Subjects were placed in a cubical cabin mounted on a hydraulic piston and subjected to vertical motion (known as 'heave'!) for two hours. The endpoint of interest was the time when the subject first vomited (known as 'frank emesis'). Some subjects requested an early stop to the experiment although they had not vomited, yielding censored observations, while others successfully survived two hours. Twenty- one subjects were studied with a frequency of and acceleration of , 14 of whom survived two hours without vomiting. The survival times (in minutes) of the other seven subjects were

其中标记为 * 的两次观察为删失。其他14次观察在120分钟时删失。
where the two observations marked * were censored. The other 14 observations were censored at 120 minutes.

表13.2 显示了频率为 、加速度为0.111 G的垂直运动实验中晕动症数据的生命表(Burns, 1984)(实验1)
Table 13.2 Life table for motion sickness data from an experiment with vertical movement at a frequency of and acceleration 0.111 G (Burns, 1984) (Experiment 1)

受试者编号存活时间(分钟)存活比例标准误
1300.9520.045
2500.9050.062
350*
4510.8550.077
566*
6820.8010.089
7920.7480.097
8120*
9120*
Subject numberSurvival time (min)Survival proportionStandard error
1300.9520.045
2500.9050.062
350*
4510.8550.077
566*
6820.8010.089
7920.7480.097
8120*
9120*
  • 删失观察
    * censored observation

表13.2显示了这些数据的生命表,给出了每个非删失存活时间点的存活比例。由于只有五名受试者呕吐,因此只有五个估计的存活概率。注意,存活概率在第一次事件(30分钟)之前保持为1,且我们无法估计超过最后观测时间120分钟的存活率。通常以图形方式展示存活概率,如图13.3所示。
Table 13.2 shows the life table for these data, giving the survival proportion at each uncensored survival time. Because only five subjects vomited there are only five estimated survival probabilities. Note that the survival probability remains 1 up to the time of the first event (30 minutes), and we cannot estimate survival beyond the last observation of 120 minutes. It is usual to present survival probabilities as a graph, as shown in Figure 13.3.

从存活曲线可以计算对应于样本任一比例的存活时间。例如,曲线与概率0.5交叉的时间对应估计的中位存活时间。然而,在本例中,由于曲线未降至0.5,我们无法估计中位数。
From the survival curve we can calculate the survival time corresponding to any proportion of the sample. For example, the time when the curve crosses the probability of 0.5 corresponds to the estimated median survival time. In this example, however, we cannot estimate the median as the curve does not fall to 0.5.

生存曲线绘制为“阶梯函数”:在事件发生之间,存活比例保持不变,即使存在一些中间的删失观察值。用斜线连接计算点是错误的。删失观察的时间有时用刻度线标示在生存曲线上,一目了然地显示存活对象的生存时间。
The survival curve is drawn as a 'step function': the proportion surviving remains unchanged between events, even if there are some intermediate censored observations. It is incorrect to join the calculated points by sloping lines. The times of censored observations are sometimes indicated by ticks on the survival curve, which shows at a glance the survival times of the surviving subjects.

我们可以计算生存比例的置信区间。如果没有删失值,可以使用推导比例置信区间的标准方法(见第10.2节),但通常我们
We can calculate a confidence interval for the survival proportion. If there are no censored values we can use standard methods for deriving a confidence interval for a proportion (see section 10.2), but in general we


图13.3 显示了与表13.2中晕动病数据相对应的生存曲线。
Figure 13.3 Survival curve corresponding to the motion sickness data in Table 13.2.

需要对删失进行调整。第13.4.1节给出了计算标准误的方法;表13.2显示了晕动病数据的标准误。一些计算机程序会提供标准误,尽管这些标准误可能是用比第13.4.1节更复杂的方法计算的。
will need to make a modification to allow for the censoring. Section 13.4.1 gives a method for calculating the standard error; Table 13.2 shows standard errors for the motion sickness data. Some computer programs will provide standard errors, although these may have been produced by a more complex method than is given in section 13.4.1.

依据标准误,我们可以计算置信区间,假设大样本中生存比例服从正态抽样分布。例如,90分钟内未呕吐的存活比例为0.801,标准误为0.089。则95%的置信区间为
From the standard error we can calculate a confidence interval, assuming a Normal sampling distribution for the survival proportion in large samples. For example, the proportion surviving 90 minutes without vomiting was 0.801 with a standard error of 0.089. The confidence interval is thus

或者0.63到0.98。像往常一样,小样本时置信区间较宽。注意,当存活比例接近1或0时,计算出的置信区间可能包含大于1或小于0的不可能值。如果发生这种情况,我们可以将上限设为1,下限设为0。然而,这表明正态近似并不合适,可能需要采用其他方法。存在更好的标准误计算方法,但它们也更复杂。
or 0.63 to 0.98. As usual, with a small sample the confidence interval is wide. Note that when the proportion surviving is near 1 or 0 the calculated confidence interval may include impossible values above 1 or less than 0. If this happens we can take 1 as the upper limit or 0 as the lower limit. However, this occurrence indicates that the Normal approximation is not really appropriate and some other method may be preferable. Better methods exist for calculating standard errors, but they are also more complicated.

本例数据来自固定时长的实验,因此大多数删失观察发生在同一时间点。在观察性研究中,如肝移植患者研究,通常在特定日期终止观察期。由于受试者在不同日期进入(如图13.1所示),存活者的随访时间差异较大,生存时间的删失点也不同。
The data used in this example are from an experiment of fixed duration, so that most of the censored observations are at the same time. In observational studies, such as the study of liver transplant patients, it is customary to stop the period of observation on a specific day. Because subjects enter on different days (as shown in Figure 13.1) survivors have widely varying periods of follow up and thus survival times censored at

本章描述的所有方法在这两种情况下均适用。
different points. All of the methods described in this chapter apply equally in both circumstances.

13.2.2 生命周期表分析 13.2.2 Life table analysis

虽然 Kaplan-Meier 生存曲线常被称为生命表,但“生命表”这一术语也经常用来描述将结果分组到时间区间(通常长度相等)中的数据。这种方法通常被称为精算法。其计算方法在原理上与 Kaplan-Meier 方法相似,但由于时间记录的不精确而产生差异。详细内容见 Armitage 和 Berry (1987,第424页)。
Although the Kaplan- Meier survival curve is often called a life table, the term life table is also frequently used to describe data where the results are grouped into time intervals, often of equal length. This method is often described as actuarial. The method of calculation is similar in principle to the Kaplan- Meier method, but differences arise because of the lack of precision of recording of times. Details are given by Armitage and Berry (1987, p. 424).

生命表也用于人口统计学中,利用当前的年龄和性别特异性死亡率估计从出生开始的某一队列的生存曲线。这些队列生命表的计算方法略有不同(Armitage 和 Berry,1987,第422页;Bland,1987,第302页)。
Life tables are also used in demography to estimate the survival curve for a cohort of people from birth using current age and sex specific mortality rates. These cohort life tables are calculated somewhat differently (Armitage and Berry, 1987, p. 422; Bland, 1987, p. 302).

13.3 两组生存曲线的比较 13.3 COMPARING SURVIVAL CURVES IN TWO GROUPS

对于旨在比较两组受试者生存情况的研究,我们可以分别计算每组的 Kaplan-Meier 曲线。任一时间点存活比例的差异标准误可以计算,并据此获得置信区间。这种方法的缺点是,它不能比较两组的整体生存情况,而仅在某些任意时间点进行比较。选择比较时间点应在分析前确定,而非在观察生存曲线后选择;否则所选时间点的比例比较无效。使用多个时间点会带来解释上的更多问题,尤其当曲线在某些时间点显著不同而在其他时间点不同时。然而,比较生存概率作为其他分析的补充是有用的,后文将进行描述。首先,我将考虑比较两组或多组独立观察的完整生存曲线的方法。
For studies in which the aim is to compare the survival experience of two groups of subjects we can calculate the Kaplan- Meier curves separately for each group. The standard error of the difference in the proportions surviving at any time can be calculated, and a confidence interval obtained. The weakness of this approach is that it does not provide a comparison of the total survival experience of the two groups, but rather gives a comparison at some arbitrary time point(s). The choice of the time point to make a comparison should really be made in advance of the analysis, not after inspection of the survival curves: the comparison of proportions thus chosen is invalid. The use of multiple time points creates further problems of interpretation, especially if the curves are significantly different at some points but not at others. Comparing survival probabilities can be useful as an adjunct to other analyses, however, and is described later. First I shall consider methods for comparing the complete survival curves for two or more independent sets of observations.

比较独立组生存时间的最常用方法是对数秩检验。顾名思义,对数秩检验是一种假设检验—原假设是各组来自相同总体。本章后面将讨论一些估计方法,但目前尚无同样广泛使用的估计方法。
The most common method of comparing independent groups of survival times is the logrank test. As its name indicates, the logrank test is a hypothesis test - the null hypothesis is that the groups come from the same population. There is no similarly widely used method of estimation, but some possibilities are considered later in this chapter.

13.3.1 对数秩检验 13.3.1 The logrank test

对数秩检验是一种非参数方法,用于检验比较组是否来自相同总体的原假设,
The logrank test is a non- parametric method for testing the null hypothesis that the groups being compared are samples from the same population as

该方法基于一个简单的思想,避免了上述任意决策。
regards survival experience. The method is based on a simple idea which avoids the arbitrary decisions referred to above.

表13.3展示了第二次晕动病实验的数据(及生命表),该实验使用不同受试者,且频率和加速度均为第一次实验的两倍。对数秩检验可用于比较这两次实验的数据。
Table 13.3 shows the data (and the life table) from a second motion sickness experiment using different subjects in which both the frequency and acceleration were doubled in comparison with the first experiment. The logrank test can be used to compare the data from the two experiments.

对数秩检验的原理是将生存时间尺度划分为若干区间,这些区间依据观察到的不同生存时间确定,且忽略删失的生存时间。第一次实验中在30、50、51、82和92分钟发生了五次明确事件(呕吐)。第二次实验中共有14次事件,分别发生在5、13、24、63、65、79、102和115分钟各一次,11、69和82分钟各两次。两次实验合并后共有15个不同的记录生存时间。图13.4展示了时间尺度被划分为15个时间区间,每个区间包括
The principle of the logrank test is to divide the survival time scale into intervals according to the distinct observed survival times, ignoring censored survival times. There were five definite events (vomiting) in the first experiment at 30, 50, 51, 82 and 92 minutes. In the second experiment there were 14 events, one each at 5, 13, 24, 63, 65, 79, 102 and 115 minutes, and 2 each at 11, 69 and 82 minutes. For the two experiments combined there were 15 distinct recorded survival times. Figure 13.4 shows the time scale divided into 15 time intervals, each of which includes the

表 13.3 运动病数据的生命表,来自频率为 、加速度为 的垂直运动实验(Burns, 1984)(实验 2)
Table 13.3 Life table for motion sickness data from an experiment with vertical movement at a frequency of and acceleration (Burns, 1984) (Experiment 2)

受试者编号生存时间(分钟)生存比例标准误
150.9640.034
26*
311
4110.8900.058
5130.8530.067
6240.8160.073
7630.7790.078
8650.7420.082
969
10690.6680.086
11790.6310.090
1282
13820.5560.090
141020.5190.093
151150.4820.093
16120*
17120*
..
28120*
Subject numberSurvival time (min)Survival proportionStandard error
150.9640.034
26*
311
4110.8900.058
5130.8530.067
6240.8160.073
7630.7790.078
8650.7420.082
969
10690.6680.086
11790.6310.090
1282
13820.5560.090
141020.5190.093
151150.4820.093
16120*
17120*
..
28120*
  • 截尾观察
    * censored observation


图 13.4 两个不同运动病实验中事件时间 和截尾时间 ,显示用于计算对数秩检验的时间区间。实验 1 见表 13.2,实验 2 见表 13.3。
Figure 13.4 Times of events and censoring for two different motion sickness experiments, showing the time intervals used for calculating the logrank test. Experiment 1 was described in Table 13.2 and Experiment 2 in Table 13.3.

事件时间位于上限。第一个时间区间是 0 到 5 分钟,第二个是 6 到 11 分钟,依此类推。对于每个时间段,我们将观察到的数据与在零假设(即实验间无真实差异)成立时的预期数据进行比较。
time of an event at the upper limit. The first interval is from 0 to 5 minutes, the second is from 6 to 11 minutes, and so on. For each time period we compare the observed data with what we would expect if the null hypothesis that there is no real difference between the experiments is true.

组进行比较的对数秩检验,为每组产生观察事件数 和预期事件数 。通过计算 ,称为 ,并将结果与自由度为 分布进行比较。
The logrank test to compare groups produces for each group an observed and an expected number of events. These are compared in a familiar way by calculating the sum of , called , comparing the result to a distribution with degrees of freedom.

运动病数据给出
The motion sickness data give

因此对数秩统计量为
so that the logrank statistic is

将该值与自由度为1的 分布进行比较,得到 ,因此有一定证据表明两次实验结果存在差异。图13.5显示实验1中无呕吐的生存率更好。
Comparing this value to a distribution with one degree of freedom gives , so there is some evidence to suggest a difference between the results of the two experiments. Figure 13.5 shows that the survival without vomiting was better in experiment 1.

注意观察值和期望值的总和是相同的:在手工计算时检查这一点非常重要。同样需要注意的是,量 更应被视为受试者暴露程度的度量,而非事件的期望数。原因在于在某些特殊情况下, 可能大于样本量。
Note that the sum of the observed and expected numbers is the same: it is important to check this when performing the calculation by hand. Note too that the quantity is better thought of as a measure of the extent of exposure of the subjects rather than the expected number of events. The reason is that under some unusual circumstances can be larger than the sample size.

对数秩检验可用于比较多个受试者组。
The logrank test can be used to compare several groups of subjects.


图13.5 显示了表13.2和表13.3中数据的生存曲线。
Figure 13.5 Survival curves for data shown in Table 13.2 and Table 13.3.

然而,定义这些组的类别通常具有自然顺序,我们应考察组间生存趋势的更具体可能性。例如,我们可能希望比较不同年龄组的生存情况,或与疾病分期相关,或与某些疑似环境危害(如吸烟)的暴露量相关。该方法是标准对数秩检验的简单扩展。
Often, however, the categories defining those groups will have a natural ordering, and we should examine the more specific possibility of a trend in survival across the groups. We might, for example, wish to compare survival in several age groups, or in relation to stage of disease, or in relation to amount of exposure of some suspected environmental hazard (such as smoking). The method is a simple extension of the standard logrank test.


图13.6 显示了乳腺癌患者的Kaplan-Meier曲线,按阳性淋巴结数分类:无阳性结节 ,1-3个 ,超过3个 (数据来自Barnes等,1988年)。
Figure 13.6 Kaplan-Meier curves for patients with breast cancer with none , 1-3 , or more than 3 positive nodes (data from Barnes et al., 1988).

图13.6 展示了三组乳腺癌手术女性的生存曲线,按阳性淋巴结数分类。普通对数秩检验得 ,自由度为2, 。但由于组间有序,应使用趋势检验,结果为 ,自由度为1, 。因此,生存率与阳性淋巴结数之间存在显著的(负向)关联。
Figure 13.6 shows survival curves for three groups of women operated on for breast cancer, classified by the number of positive nodes found. An ordinary logrank test gives on 2 degrees of freedom . Because the groups are ordered, however, the trend test should be used, which gives on 1 degree of freedom . There is thus a significant (negative) association between survival and number of positive nodes.

对数秩检验还可扩展以调整其他变量。例如,在比较不同手术类型乳腺癌患者生存的随机试验中,我们可能希望在分析中考虑乳腺癌分期或其他预后变量。在这种分层分析中,受试者根据预后变量(癌症分期)分为亚组,计算每个分层内各治疗组的 值。然后将每个治疗组各分层的 相加,使用常规对数秩公式比较这些总和以计算 。如果某治疗组恰巧包含更多预后较差的受试者,分层分析将调整这种不平衡。同样方法可用于多中心研究中不同中心数据的合并。第15章进一步讨论了调整比较的必要性。第13.4节详细展示了对数秩检验的执行方法,并给出了更准确的对数秩统计量 公式,同时描述了趋势检验和分层分析。多款计算机程序可执行对数秩分析,手工计算除小数据集外较为繁琐,但这些程序的结果输出信息并不总是充分(见第13.8节)。Peto等(1977年)详尽讨论了本节涉及的所有方法及更多内容,其论文为必读资料。
The logrank test can also be extended to allow an adjustment to be made for other variables. For example, in a randomized trial to compare survival in groups of breast cancer patients given different types of surgery we may wish to allow for the stage of breast cancer in the analysis, or for some other prognostic variable. In this stratified analysis, the subjects are divided into subgroups according to the prognostic variable (stage of cancer) and the values of and calculated for each treatment group within each stratum (subgroup). For each treatment group the values of and from each stratum are added up and then these sums are compared using the usual logrank formula to get . If, by chance, one treatment group includes more subjects with a poor prognosis this stratified analysis will adjust for the imbalance. The same method can be used to combine data from different centres in a multicentre study. There is further discussion of the need to make adjusted comparisons in Chapter 15. The method for performing the logrank test is shown in detail in section 13.4, which also gives a rather more accurate formula for the logrank statistic . The test for trend and stratified analysis are also described. Several computer programs can perform the logrank analysis, which is tedious by hand except for very small data sets, but they do not all give enough information in their output of results (see section 13.8). Peto et al. (1977) give detailed discussion of all the methods discussed in this section, and much else besides - their paper is essential reading.

13.3.2 危险比 13.3.2 The hazard ratio

对数秩检验广泛用于比较两个或多个组的生存情况,但它仅是一个假设检验。它不直接提供组间差异的具体信息。
The logrank test is very widely used for comparing survival in two or more groups, but it is solely a hypothesis test. It provides no direct information of how different the groups were.

衡量两组相对生存率的一种方法是比较观察到的事件数与预期事件数。比值 表示第一组观察到的事件率占在零假设成立时预期事件率的比例,因此比值
One way to measure the relative survival in two groups is to compare the observed number of events with the expected numbers. The ratio gives the observed event rate in the first group as a proportion of that expected if the null hypothesis were true, and so the ratio

给出了两组事件率的相对估计值。该比值也称为风险比(hazard ratio)。对于晕动病数据,我们有
gives an estimate of the relative event rates in the two groups. This ratio is also called the hazard ratio. For the motion sickness data we have

因此,实验1条件下呕吐的相对风险或风险比估计为实验2的0.41(41%)。
so that the estimated relative risk or hazard of vomiting under the conditions of experiment 1 is 0.41 of that for experiment 2.

我们可以计算 的近似置信区间,如第13.4.5节所述。在本例中,95%置信区间为0.18到1.08,包含了对应于风险相等的值1。正如我们预期的,由于样本量较小,置信区间非常宽。样本量和检验效能将在第13.7节讨论。
We can calculate an approximate confidence interval for , as described in section 13.4.5. In this case the confidence interval is from 0.18 to 1.08, and thus includes the value of 1 corresponding to equal hazards. As we should expect from this small sample, the confidence interval is very wide. Sample size and power are discussed in section 13.7.

两组相对风险的计算基于整个研究期间。两组的相对风险在整个期间保持不变并非必然。实际上,相对风险很可能变化,此时风险比不适用于整个研究期。生存曲线图能直观反映效应的一致性,是生存数据分析的关键组成部分。对于大样本,我们可以计算各组在多个时间段的风险及风险比,并检查风险比随时间的一致性。
The calculation of the relative hazard in the two groups is based on the complete period studied. It is not necessarily true that the relative hazard stays much the same in the two groups throughout that period. Indeed it is quite likely that it will vary, in which case the hazard ratio will not apply throughout the period studied. The plot of survival curves will give a visual impression of the consistency of the effect and is an essential component of the analysis of survival data. With large samples we can calculate the hazards in each group, and thus the hazard ratio, for each of several time periods, and examine the consistency of the hazard ratio over time.

13.3.3 生存概率比较 13.3.3 Comparison of survival probabilities

正如我们可以为单组个体计算的生存概率获得置信区间一样,也可以计算两组个体生存概率差异的置信区间。计算此类置信区间的方法见第13.4.6节。
Just as we can obtain a confidence interval for a survival probability calculated from a single group of individuals, so we can calculate a confidence interval for the difference between the survival probabilities calculated from two groups of individuals. The method for calculating such a confidence interval is given in section 13.4.6.

例如,我们可以计算前述两个实验中,存活60分钟且未发生呕吐的估计生存概率差异的置信区间。两组生存概率(见表13.2和13.3)分别为0.855和0.816,差值为 ,95%置信区间为 到0.25。
For example, we can calculate the confidence interval for the difference between the estimated probabilities of surviving 60 minutes without being sick for the two experiments already described. The two survival probabilities, as shown in Tables 13.2 and 13.3, are 0.855 and 0.816. The difference is , and the confidence interval is from to 0.25.

该方法的主要缺点是置信区间仅适用于某一时间点。为了有效,该时间点必须在观察数据之前预先确定—从生存曲线中选择时间点是错误的。可以计算多个(甚至所有)时间点的置信区间,但结果难以解释。除非有事先理由比较某一特定时间点的生存比例,否则更好使用风险比来估计两组间生存差异。无论如何,风险比是比较生存的更自然方式。另一种选择是计算中位生存时间的比值;该方法在第13.4.7节中描述。
The main disadvantage of this method is that the confidence interval applies only to one time point. To be valid, that time point must be chosen in advance of seeing the data - it is wrong to choose the time from an inspection of the survival curves. It is possible to calculate confidence intervals for several (or even all) times, but there is no easy way to interpret the results. Unless there is a prior reason for comparing survival proportions at a particular time point it is probably better to use the hazard ratio to derive an estimate of the difference in survival between two groups. In any case, the hazard ratio is a more natural way of comparing survival. Another option is to calculate the ratio of the median survival times; this method is described in section 13.4.7.

13.4 数学计算与实例分析 13.4 MATHEMATICAL CALCULATIONS AND WORKED EXAMPLE

(本节可省略,且不影响内容连贯性。)
(This section can be omitted without loss of continuity.)

大多数统计软件不包含生存时间分析方法。此外,即使包含,也无法完成第13.2和13.3节中描述的所有计算,尤其是置信区间的计算。方法本身数学上不复杂,但操作较为繁琐。
Most statistical computer programs do not include methods for analysing survival times. Further, those that do cannot perform all of the calculations described in sections 13.2 and 13.3, especially those needed to produce confidence intervals. The methods are not mathematically complex, but they can be somewhat fiddly.

13.4.1 生存曲线(Kaplan-Meier法) 13.4.1 Survival curve (Kaplan-Meier)

生存概率计算的原理已在13.2节中概述。某一时间点(如100天)的生存比例,是通过乘积计算从起始到该时间点每天的生存概率。我们只需考虑发生事件或“失败”(如死亡)的日子。如果第100天有死亡事件,则第100天的生存比例等于第99天的生存比例乘以第99天存活者中第100天仍存活者的比例。设为存活天的概率,为第天前仍处于风险中的个体数(即仍在随访中),为第天观察到的失败数,则有
The principle behind the calculation of survival probabilities was outlined in section 13.2. The proportion surviving a given length of time, say 100 days, is calculated by multiplying the probabilities of surviving each day up to that time. We need only consider days on which there is an event or 'failure' (e.g. death). If there is a death at 100 days, then we estimate the proportion surviving 100 days as the proportion surviving 99 days multiplied by the proportion of those surviving 99 days who also survive 100 days. If is the probability of surviving days, is the number of subjects still at risk (i.e. still being followed up) immediately before the th day, and is the number of observed failures on day , then we have

这就是上一句的数学表达。
This is a mathematical representation of the statement in the previous sentence.

对表13.2中的数据,时间单位为分钟,“失败”定义为呕吐。无呕吐生存比例在29分钟内为1,因此,且,因为30分钟前所有受试者仍处于风险中。30分钟时发生一次失败,故,可计算30分钟的生存比例为
For the data in Table 13.2 the time unit is minutes, and a 'failure' was vomiting. The proportion surviving without vomiting is 1 up to 29 minutes. We therefore have , and because all subjects are still at risk at 30 minutes. There was one failure at 30 minutes, so and we can calculate the proportion surviving 30 minutes as

如表13.2所示。估计的存活比例保持不变,直到下一个失败时间,即50分钟。我们假设在同一分钟被删失的受试者3在受试者2“失败”时仍处于风险中,因此我们有
as shown in Table 13.2. The estimated proportion surviving stays the same until the next failure time, which is 50 minutes. We assume that subject 3 who was censored at the same minute was still at risk at the time when subject 2 'failed', so we have

因为在50分钟时仍有20名受试者处于风险中。一个受试者在50分钟时退出,因此其时间被删失,风险人数
because there were only 20 subjects still at risk at 50 minutes. One subject withdrew at 50 minutes so their time was censored, and the number at risk

表13.4 计算表13.2中数据的存活概率(Kaplan-Meier存活曲线)
Table 13.4 Calculation of survival probabilities (Kaplan-Meier survival curve) for data in Table 13.2

受试者编号 (k)存活时间(分钟)风险人数 (r_k)观察失败数 (f_k)(r_k - f_k)/ r_k存活比例 (p_k)
1302110.95240.9524
2502010.95000.9048
350*
4511810.94440.8545
566*
6821610.93750.8011
7921510.93330.7476
8120*
9120*
..
..
21120*
Subject number (k)Survival time (min)Number at risk (rk)Observed failures (fk)rk - fk/rkSurvival proportion (pk)
1302110.95240.9524
2502010.95000.9048
350*
4511810.94440.8545
566*
6821610.93750.8011
7921510.93330.7476
8120*
9120*
..
..
21120*
  • 删失观察
    * censored observation

在51分钟时风险人数因此只有18。完整数据的计算见表13.4。感兴趣的列,即存活比例,是前一列从表顶开始所有条目的乘积。注意删失观察的唯一影响是改变下一个未删失存活时间的风险人数。
at 51 minutes was thus only 18. The calculations for the complete set of data are shown in Table 13.4. The column of interest, the survival proportion, is simply the product of all the entries from the top of the table in the previous column. Note that the only effect of the censored observations is to alter the number at risk at the next uncensored survival time.

存活比例的标准误可以用多种方法计算,尽管不同公式结果非常相近。一个简单的公式是
The standard error of the survival proportion can be calculated in various ways, although the different formulae give very similar results. A simple formula is

其中 ( p_k ) 是时间点 ( k ) 的估计存活比例。表13.2和13.3中的标准误即用此公式计算。假设 ( p_k ) 近似服从正态分布,我们可以计算 ( p_k ) 的95%置信区间为
where is the estimated proportion surviving at time . The standard errors in Tables 13.2 and 13.3 were calculated using this formula. On the assumption that will have an approximately Normal sampling distribution we can calculate a confidence interval for as

对于小样本量或极大或极小的概率(例如超出0.2到0.8的范围),这种近似并不准确,在这种情况下置信区间可能会超出0到1的范围。虽然置信区间可以在边界处截断(例如将范围“0.75到1.10”改为“0.75到1.0”),但这表明数据量不足。
This is not a good approximation for small sample sizes or for very large or small probabilities, say outside the range 0.2 to 0.8, under which circumstances the confidence interval can go outside the range 0 to 1. While the confidence interval can be curtailed at the limit (e.g. change the range

对估计的生存概率的标准误差有许多替代公式,其中最著名的是格林伍德公式:
'0.75 to 1.10' to '0.75 to 1.0') this is an indication of an inadequate amount of data. There are many alternative formulae for the standard error of an estimated survival probability, the best known being due to Greenwood:

计算机程序可能会使用比示例中更准确的公式。表13.2和13.3显示了随着仍处于风险中的人数减少,运动病数据的标准误差如何增加,这与一般预期一致。
Computer programs are likely to use a more accurate formulae than the one used in the example. Tables 13.2 and 13.3 show how the standard errors for the motion sickness data increase as the number still at risk falls, as we would expect in general.

13.4.2 对数秩检验 13.4.2 The logrank test

对两个或多个组的受试者生存经历相同的原假设进行对数秩检验,涉及计算不同时间区间内的观察失败数和期望失败数,并将它们相加。该方法通过表13.2和13.3中显示的两组观察数据进行说明。
The logrank test of the null hypothesis of the same survival experience in two or more groups of subjects involves calculating the observed and expected numbers of failures in separate time intervals, and summing these. The method is illustrated using the two groups of observations shown in Table 13.2 and 13.3.

如图13.4所示,研究时间跨度被划分为以一个或多个失败事件结束的时间区间,尽管这等同于仅考虑失败发生的时间点,就像计算生存概率时一样。对于每个失败时间点,我们计算每组的风险人数((r_{1}) 和 (r_{2}))以及观察到的失败人数((f_{1}) 和 (f_{2}))。基于原假设为真,计算每组的期望失败人数。每个时间点我们有一个如下的 (2 \times 2) 表:
As shown in Figure 13.4, the time span of the study is divided into time intervals ending with one or more failures, although this is equivalent to considering only the minutes of failures, as for the calculation of survival probabilities. For each minute with a failure we calculate the numbers at risk in each group and ) and the numbers of observed failures and . From these we calculate the expected number of failures assuming the null hypothesis is true. At each time we have a table as follows:

组1组2总计
失败f1f2f
未失败r1 - f1r2 - f2r - f
总计r1r2r
Group 1Group 2Total
Failuresf1f2f
Not failuresr1 - f1r2 - f2r - f
Totalr1r2r

我们按照第10章的方法计算期望失败数,即 (e_{1} = r_{1}f / r) 和 (e_{2} = r_{2}f / r)。然后对整个表的观察值和期望值求和,得到 (O_{1} = \sum f_{1})、(E_{1} = \sum e_{1}) 等。注意 (O_{1} + O_{2} = E_{1} + E_{2}),这一等式应在手工计算时加以验证。计算对数秩检验统计量的最简单方法是通过
We calculate expected numbers of failures as in Chapter 10, so that and . We then sum the observed and expected values for the whole table to get , , etc. Note that , an equivalence that should be verified for hand calculations. The simplest way to calculate the logrank test statistic is by

然而,通过计算每个时间点 (f_{1} - e_{1}) 的方差,可以得到稍微更好的结果:
However, a slightly better answer can be obtained by calculating the variance of at each time as

并将这些值累加得到总方差 (V = \Sigma v)。检验统计量的另一种形式为:
and summing these values overall to get . The alternative form of the test statistic is given by

实际应用中,这两种方法通常给出相似的结果。
In practice the two methods usually give similar answers.

表13.5展示了晕动病数据的计算过程。11分钟和69分钟时有两次失败,82分钟时有三次失败,因此使用两种logrank检验方法不会得到相同结果。第一种方法计算得:
The calculations for the motion sickness data are shown in Table 13.5. There were two failures at 11 and 69 minutes and three at 82 minutes, so we will not get the same answer using the two versions of the logrank test. The first method gives

表13.5 计算晕动病数据的logrank检验统计量。下标指代实验1和实验2。
Table 13.5 Calculating the logrank test statistic for the motion sickness data. The subscripts refer to Experiments 1 and 2

时间(分钟)r1r2rf1f2fe1 = r1f/rf1 - e1v = \frac{r1 r2 f (r - f)}{r^{2} (r - 1)}
52128490110.4286-0.42860.2449
6*212748
112126470220.8936-0.89360.4836
132124450110.4667-0.46670.2489
242123440110.4773-0.47730.2495
302122431010.48840.51160.2499
502022421010.47620.52380.2494
50*192241
511822401010.45000.55000.2475
631722390110.4359-0.43590.2459
651721380110.4474-0.44740.2472
66*162137
691620360220.8889-0.88890.4797
791618340110.4706-0.47060.2491
821617331231.4545-0.45450.7025
921515301010.50000.50000.2500
1021415290110.4828-0.48280.2497
1151414280110.5000-0.50000.2500
总计514198.8607-3.86074.6478
O1O2E1O1 - E1V
Time (mins)r1r2rf1f2fe1 = r1f/rf1 - e1v = r1r2f(r - f)/r2(r - 1)
52128490110.4286-0.42860.2449
6*212748
112126470220.8936-0.89360.4836
132124450110.4667-0.46670.2489
242123440110.4773-0.47730.2495
302122431010.48840.51160.2499
502022421010.47620.52380.2494
50*192241
511822401010.45000.55000.2475
631722390110.4359-0.43590.2459
651721380110.4474-0.44740.2472
66*162137
691620360220.8889-0.88890.4797
791618340110.4706-0.47060.2491
821617331231.4545-0.45450.7025
921515301010.50000.50000.2500
1021415290110.4828-0.48280.2497
1151414280110.5000-0.50000.2500
Total514198.8607-3.86074.6478
O1O2E1O1 - E1V

注意:
NB:

而第二种更精确的方法给出
while the second, more precise, method gives

显然这里的差异可以忽略不计,通常第一种统计量 的公式是足够的。它的优点是不需要计算较为复杂的方差。
There is clearly a negligible difference here, and in general the first formula for the statistic will be satisfactory. It has the advantage of not requiring the calculation of the rather complicated variances.

在原假设下,当有 组观察时,统计量 服从自由度为 的卡方分布。因此,在本例中应将计算得到的 值与自由度为1的卡方分布进行比较,得到
Under the null hypothesis the statistic has a distribution with degrees of freedom when there are groups of observations. Thus for the example we should compare the calculated value of with a distribution with 1 degree of freedom, which gives

Logrank检验可以用于两个以上的数据集。统计量 通过扩展上述第一个公式计算,对每个组都包含一项。如果有 组,则有
The logrank test can be carried out with more than two sets of data. The statistic is calculated using an extension of the first equation above with a term for each group. If we have groups we have

的值与自由度为 的卡方分布比较。然而,如果各组存在自然顺序,则应进行趋势检验,如下所述。
The value of is compared with a distribution with degrees of freedom. If there is a natural ordering of the groups, however, then a test for trend should be performed, as described below.

13.4.3 趋势的logrank检验 13.4.3 The logrank test for trend

对于三个或更多有序组,更合适的检验是考虑组间生存率是否存在趋势。例如,我们可能想比较不同年龄组,或不同癌症分期的患者。该检验同样适用于研究被划分为三个或更多组的连续变量的可能影响。其分析原则类似于第10.8.2节中描述的 频数表的趋势卡方检验。
With three or more ordered groups, a more appropriate test is to consider the possibility that there is a trend in survival across the groups. We may, for example, wish to compare age groups, or patients with different stages of cancer. This test is also appropriate for studying the possible effect of continuous variables which have been separated into three or more groups. The analysis is similar in principle to the Chi squared test for trend for a frequency table, described in section 10.8.2.

使用前一节中给出的方法,我们可以获得每个组的 ,其中 表示组号 。如果我们给每个组赋予一个代码 (不必等距),那么我们可以计算每个组的
Using the method given in the previous section, we can obtain and for each group where denotes the group's number If we give a code to each group (not necessarily equally spaced), then we can calculate for each group

趋势检验的统计量计算如下:
The test statistic for trend is obtained as

其中
where

统计量 与自由度为1的卡方分布进行比较,无论分析多少组。注意,统计量 必须介于零和通常用于评估组间总体异质性的logrank统计量 之间。该方法本质上是一个假设检验。
The test statistic is compared with the distribution with one degree of freedom, however many groups are being analysed. Note that the statistic must lie between zero and the usual logrank statistic which is used to evaluate general heterogeneity among the groups. Again the method is purely a hypothesis test.

下面以图13.6中195名乳腺癌女性的生存数据为例。根据是否有阳性淋巴结,将女性分为三组:无阳性结节、少量(1-3个)和大量(超过3个)。各组的 值如下:
An example is given by the survival data from 195 women with breast cancer shown in Figure 13.6. Women were divided into three groups according to whether they had no positive nodes, a few (1- 3) or many (more than 3). The values of and for each group were as follows:

阳性淋巴结女性人数死亡人数 (Og)预期死亡数 (Eg)Og - Eg
1023846.41-8.41
少量 (1-3)582625.210.79
大量 (> 3)352214.387.62
Positive nodesNumber of womenNumber of deaths (Og)Expected (Eg)Og - Eg
none1023846.41-8.41
few (1-3)582625.210.79
many (&gt; 3)352214.387.62

对这些数据进行常规logrank检验得到 ,自由度为2,。然而,组是有序的,因此应使用logrank趋势检验。如果给三组赋予代码 ,则得到如下结果:
The usual logrank test on these data yields on 2 degrees of freedom . However, the groups are ordered so the logrank test for trend should be used. If we give the groups codes of and , we get the following:

阳性淋巴结Ag组Bg组Cg组
8.41-46.4146.41
少量0.000.000.00
大量7.6214.3814.38
总计16.03-32.0360.77
Positive nodesAgBgCg
none8.41-46.4146.41
few0.000.000.00
many7.6214.3814.38
Total16.03-32.0360.77

(注意,上述代码的选择简化了计算。)
(Note how the above choice of codes simplifies the arithmetic.)

根据这些数值,我们可以计算出 。因此,几乎所有组间的变异都可归因于趋势;统计量 与自由度为1的卡方分布进行比较,得到
From these values we can calculate and . Thus almost all of the variation among the groups can be attributed to a trend; the statistic is compared with the Chi squared distribution with one degree of freedom, giving

13.4.4 分层logrank检验 13.4.4 Stratified logrank test

我们可以将子集数据合并,以获得对主要感兴趣组更敏感的比较。例如,如果我们想比较接受不同治疗的两组,可能希望按年龄或其他预后变量进行分层,特别是当高风险受试者数量在组间存在差异时。
We can combine data for subsets of subjects to get a more sensitive comparison of the groups of main interest. For example, if we are interested in comparing two groups given different treatments we may wish to stratify by age or some other prognostic variable, especially if the

这里分层的作用类似于多元回归分析中对其他变量的调整(见12.4节)。同样的方法可用于合并相同治疗的独立试验数据。在任一情况下,分层分析都比简单合并所有数据的分析更可靠。
numbers of high risk subjects differ between the groups. The effect of stratification here is much the same as adjusting for other variables in a multiple regression analysis (see section 12.4). The same method can be used to combine data from independent trials of the same treatments. In either case the stratified analysis will be more reliable than an analysis simply pooling all the data.

分层logrank检验非常简单。如果我们有两组受试者,那么对于每个感兴趣的子组(层),计算 。然后对所有层求和,计算logrank统计量为
The stratified logrank test is very simple. If we have two groups of subjects, then for each subgroup (stratum) of interest we calculate and . These are then summed over all strata and the logrank statistic calculated as

如果原假设成立,统计量 服从自由度为 的卡方分布,其中 是组数。
If the null hypothesis is true the statistic has a distribution with degrees of freedom, where there are groups.

13.4.5 危险比 13.4.5 The hazard ratio

如第13.3.2节所述,两个组的相对生存经验可以表示为
As noted in section 13.3.2, the relative survival experience of two groups can be expressed as

这称为风险比。我们可以计算 的近似置信区间,从而得到 的置信区间(Simon,1986)。我们使用第13.4.2节中第二个公式推导的方差,计算
which is termed the hazard ratio. We can calculate an approximate confidence interval for and so obtain a confidence interval for (Simon, 1986). We use the variance derived from the second formula given in section 13.4.2 and calculate

这是对对数风险比的估计(与观察到的风险比的对数相似)。该估计的标准误差约为 ,因此 的95%置信区间为 。通过对这些值取反对数即可轻松得到 的95%置信区间。
which is an estimate of the log hazard ratio (and will be similar to the log of the observed hazard ratio). The standard error of this estimate is approximately , so a confidence interval for is given by to . A confidence interval for is thus obtained easily by antilogging these values.

对于晕动症数据,我们有
For the motion sickness data we had


and

因此我们得到
so we have

危险比 置信区间因此是从 ,即从 0.18 到 1.08。
The confidence interval for the hazard ratio is thus from to , that is from 0.18 to 1.08.

13.4.6 生存概率的比较 13.4.6 Comparison of survival probabilities

使用第 13.4.1 节中给出的方法,我们可以分别估计两个独立个体组在某一时间点的生存概率及其标准误,记为 的标准误,按惯例,计算公式为
Using the method given in section 13.4.1 we can estimate the survival probability and its standard error at some time point separately for two independent groups of individuals, say , , and . The standard error of is, as usual, given by

因此,生存比例差异的 置信区间为
A confidence interval for the difference in survival proportions is thus given by

例如,我们可以比较两次晕动病实验中 60 分钟时的生存比例。数据为
For example, we can compare the survival proportion at 60 minutes in the two motion sickness experiments. We have

因此
SO


and

因此,60分钟时 置信区间为
and thus the confidence interval for at 60 minutes is

即从 到 0.25。虽然对数秩检验整体上显示出一些差异的证据,但在60分钟时两组数据之间的差异并不明显。
that is, from to 0.25. There is little apparent difference between the two sets of data at 60 minutes, although the logrank test showed some evidence of a difference overall.

13.4.7 比较中位生存时间 13.4.7 Comparing median survival times

如前所述,从Kaplan-Meier生存曲线中很容易得到中位生存时间的估计值。Simon(1986)给出了计算中位生存时间置信区间的方法。
As I observed earlier, it is easy to derive an estimate of the median survival time from the Kaplan- Meier survival curve. Simon (1986) gives a method for calculating a confidence interval for the median survival time.

Simon还提出了一个简单方法,用于计算两个独立估计中位生存时间比值的近似置信区间。
Simon also gives the following simple method for calculating an approximate confidence interval for the ratio of two independent estimated median survival times.

如果 是两个独立样本的中位生存时间,
If and are the median survival times of two independent samples.

则近似的 置信区间为
the approximate confidence interval is

其中
where

该方法假设失败时间服从指数分布;快速检验该假设的方法是查看每个计算出的中位数是否与假设成立时的预期值相近,即生存时间总和(无论是否删失)除以事件数乘以 。例如,表13.3中的观察中位数为115分钟,而若分布为指数分布,预期值为 分钟。然而,我们无法比较两次晕动病实验的中位数,因为表13.2的数据没有估计中位数。
The method assumes that the failure times have an exponential distribution; a quick check of this assumption is to see if each calculated median is similar to that expected if the assumption is true, namely the sum of the survival times (whether censored or not) divided by times the number of events. For example, the observed median for the data in Table 13.3 is 115 minutes whereas the expected value if the distribution was exponential is minutes. We cannot compare the medians for the two motion sickness experiments, however, as we have no estimated median for the data in Table 13.2.

13.4.8 评述 13.4.8 Comment

生存分析最重要的部分是绘制各组感兴趣的生存曲线,但差异的评估应基于统计分析。对数秩检验是最常用的统计分析方法,但它是一个假设检验,不提供相对生存的估计。虽然提出的估计方法各有缺陷,但只要曲线显示相对生存率随时间变化不大,风险比是最具吸引力的指标。例如,对于交叉的生存曲线,这种假设就不成立。风险比还与第13.6节中描述的更复杂的生存数据回归分析方法相关,其中一个重要假设是风险比随时间保持恒定。
The most important part of survival analysis is to produce a plot of the survival curves for each group of interest, but assessment of possible differences should be based on statistical analysis. The logrank test is the most common form of statistical analysis, but it is a hypothesis test and yields no estimate of relative survival. None of the estimates proposed is without problems, but the hazard ratio is the most appealing as long as the curves suggest that the relative survival rates do not vary greatly over time. This would not be so, for example, for survival curves that crossed. The hazard ratio also gives a link with the more complex regression approach to the analysis of survival data, described in section 13.6, where an important assumption is that the hazard ratio is constant over time.

所有生存分析的一个假设是删失时间不包含信息。在晕动病例子中,我们可能会质疑那些提前请求停止实验的个体是否接近发病状态。这里有理由将提前停止视为失败事件,而非删失观察。
An assumption of all survival analyses is that there is no information in the times of censored observation. In the motion sickness example, we may question whether those individuals who requested an early stop to the experiments would have been near to being sick. There is a case here for regarding an early stop as a failure rather than as a censored observation.

13.5 错误的分析方法 13.5 INCORRECT ANALYSES

Peto 等人(1977)描述了几种错误的生存数据分析方法,以下将讨论其中一些。其他一些则涉及临床试验的一般问题,见第15章。我还将解释为何比较对治疗有反应与无反应者的生存率是不合理的。
Peto et al. (1977) describe several incorrect approaches to the analysis of survival data, some of which are discussed below. Some others relate to clinical trials in general and are discussed in Chapter 15. I also explain why it is invalid to compare the survival of those who do or do not respond to treatment.

13.5.1 生存总结 13.5.1 Summarizing survival

一个常见的错误是用某个合适时间点后仍存活的受试者比例(或其他类似指标)来总结生存情况。例如,在一项针对心肌梗死(MI)男性患者使用β受体阻滞药的研究中,我们可能计算服药一年内再次发生MI的比例。除了选择一年这一时间点的任意性外,这种分析忽略了受试者在无复发期间的具体生存时间信息,如果并非所有受试者都被随访满一年,这种分析还会产生偏倚。更糟糕的是计算平均生存时间,因为当部分生存时间被删失时,平均生存时间无法给出合理的结果。
A common error is to summarize survival by the proportion of subjects still alive (or whatever) at some suitable time after the start of the study. For example, in a study of a beta- blocking drug given to men who had suffered a myocardial infarction (MI) (heart attack) we could calculate the proportion who had had another MI within a year of being on the drug. Apart from the arbitrary choice of one year, such an analysis ignores information about exactly how long the subjects survived without another attack and it will give a biased answer if, as is likely, not all subjects were followed up for a year. An even worse approach is to calculate the mean survival time, as this cannot provide a sensible answer when some of the survival times are censored.

计算中位生存时间是合理的,但必须基于Kaplan-Meier曲线,而非原始数据(除非没有删失数据)。中位生存时间可以直接从绘制的生存曲线上读取,即对应生存比例为0.5的时间。不幸的是,除非生存曲线跌破0.5,否则无法计算样本中位数;即使跌破,这个估计在样本量不大时也不够精确,无法准确反映总体的中位生存时间。
The calculation of the median survival time is sensible, but it must be derived from the Kaplan- Meier curve, and not from the raw data unless there are no censored observations. The median survival time can easily be read from the plotted survival curve, being the time corresponding to a survival proportion of 0.5. Unfortunately, the sample median cannot be calculated unless the survival curve drops below 0.5, and even if it does it is an imprecise estimate of the median survival time in the population except in large samples.

13.5.2 生存曲线 13.5.2 Survival curves

生存曲线应绘制为“阶梯函数”,如图13.3至13.6所示;仅用斜线连接每个死亡时间点的生存概率估计是不正确的。
The survival curve should be drawn as a 'step function' as in Figures 13.3 to 13.6; it is incorrect simply to join the estimated survival probabilities at each time of death with sloping lines.

对生存曲线的误解常表现为对曲线右端部分的过度解读。生存曲线通常会在一段时间后趋于平缓,因为事件发生率降低。除非仍有大量受试者处于风险中,否则不应将这种平缓视为有意义的现象。相反,如果最后一次死亡发生在最后一次删失时间之后(这并不罕见),生存曲线会骤降至零,我们不应因此断定无人能存活至该时间之后。比较两条生存曲线时,研究期末曲线间距常大于起始时,这本身不应被视为曲线分歧。上述情况多因尾部样本量小导致曲线不稳定。对此有两种简单对策:始终在固定时间间隔(如每月或每年)显示风险人数,并在风险人数仅剩5人时截断生存曲线。两条生存曲线的比较应基于先前描述的方法,特别是使用包含全部数据的对数秩检验,而非仅凭视觉印象。
Mistaken interpretation of survival curves often involves over- interpretation of the right- hand part of the curve. It is common for survival curves to flatten out after a while, as events become less frequent. It is unwise to interpret this flattening as meaningful unless there are many subjects still at risk. In contrast, if the last death occurs after the last censored time, not a rare occurrence, the survival curve will plunge to zero. We should not take this as an indication that nobody will survive beyond that time. When two survival curves are compared there is frequently a larger gap between the curves at the end of the period under study than at the beginning. This should not of itself be taken as an indication that the curves diverge. All of these situations often occur simply because the tail of the curve is very unstable due to small numbers at risk. There are two simple remedies: always show the numbers at risk at regular time intervals (e.g. every month or year, as appropriate) and curtail the survival curve when there are, say. only five subjects still at risk. The comparison of two survival curves should be based upon the methods already described, especially the logrank test using all the data, not upon visual impression.

此处重申之前的警告:不要在通过观察生存曲线后选择某时间点来比较存活比例。比较只有在时间点预先确定的情况下才有效。
This is an appropriate place to repeat the earlier warning about not comparing the proportions surviving a certain period when the time point for the comparison is chosen by inspecting the survival curves. The comparison is only valid if the time was chosen in advance of collecting the data.

13.5.3 比较应答者与非应答者 13.5.3 Comparing responders and non-responders

许多临床研究中,患者可根据是否观察到治疗反应进行分类。例如,在癌症药物试验中,通常观察肿瘤是否缩小。此时自然想比较应答者与非应答者的生存情况。不幸的是,这种分析无效(Oye和Shapiro,1984),因为分组依据是治疗开始时未知的因素。分析存在偏倚,应答者必须存活一定时间才能表现出反应;此外,即使未治疗,应答者本身可能更可能存活更长时间。应答者生存更长并不意味着治疗有效。一些癌症期刊已明确禁止此类分析。
In many clinical studies it is possible to categorize patients according to whether or not there is some observed response to treatment. For example, in cancer drug trials it is usual to see if the tumour has responded (shrunk) following treatment. It is then natural to wish to compare the survival of responders and non- responders. Unfortunately, this analysis is not valid (Oye and Shapiro, 1984) because the groups are defined by a factor not known at the start of treatment. The analysis is biased because the responders must have survived for a certain period in order to achieve a response. Also, the patients who respond may have been more likely to survive longer even if not treated. The fact that responders survive longer does not mean that the treatment is useful. Some cancer journals have specifically banned this type of analysis.

更好的方法是比较非应答者从治疗开始时的生存与应答者从应答时点的生存。但这种分析也可能产生误导(Simon和Makuch,1984)。若考虑此类分析,强烈建议寻求专业统计咨询。
A better approach is to compare the survival of non- responders from the start of treatment with that of responders from the time of response. This analysis too may give misleading results, however (Simon and Makuch, 1984). Expert statistical advice is strongly recommended if this type of analysis is contemplated.

13.5.4 多重比较 13.5.4 Multiple comparisons

与其他简单分析方法(如 检验和相关分析)一样,当我们希望探讨多个变量与生存的关系时,logrank 检验应谨慎使用。虽然观察哪些变量似乎与更好的预后相关是有用的,但这些变量之间往往也存在相关性。此外,20 个变量中就有一个变量可能仅因偶然而显著,从而看似重要。因此,更好的方法是类似于多元回归分析的方法;下一节将介绍这种方法。
As with other simple analyses (such as the test and correlation) the logrank test should be used with care when we wish to explore the relation of numerous variables to survival. While it is useful to see which variables seem to be associated with a better prognosis, these variables are likely to be correlated with each other too. Also, one variable in 20 will be significant and thus appear important just by chance. A better approach, therefore, is one that is analogous to multiple regression analysis; such an approach is described in the next section.

13.6 生存建模—Cox 回归模型 13.6 MODELLING SURVIVAL - THE COX REGRESSION MODEL

(本节内容较本书其他部分更为复杂。)
(This section is more complex than the others in the book.)

logrank 检验是一种非参数方法,用于比较两个或多个组的生存情况。它不能用于探讨多个变量对生存的影响。Cox(1972)提出的回归方法在需要同时研究多个变量时被广泛使用。
The logrank test is a non- parametric method for comparing the survival experience of two or more groups. It cannot be used to explore the effects of several variables on survival. The regression method introduced by Cox

该方法也被称为比例风险回归分析。
(1972) is used widely when it is desired to investigate several variables at the same time. It is also known as proportional hazards regression analysis.

Cox 方法是一种“半参数”方法—对生存时间不假定特定的分布类型,但强烈假设不同变量对生存的影响随时间保持恒定,并且在特定尺度上是加性的。该方法本身过于复杂,无法在本书中详细讨论;本节旨在介绍该方法的基本思想,帮助理解此类分析结果。进行 Cox 回归时存在许多潜在困难,因此我不建议非统计学专业人员使用该方法。
Cox's method is a 'semi- parametric' approach - no particular type of distribution is assumed for the survival times, but a strong assumption is made that the effects of the different variables on survival are constant over time and are additive in a particular scale. The actual method is too complex for detailed discussion in this book; this section is intended to give an introduction to the ideas of the method, which should help when reading the results of such analyses. There are many potential difficulties when performing Cox regression, and I do not recommend that the method is used by non- statisticians.

风险函数与生存曲线密切相关,表示在给定时间之后的极短时间内死亡的风险,前提是假设至此仍存活。因此,它可以解释为时间 时刻的死亡风险。Cox 方法在功能上等同于第 12.4 节描述的多元回归分析,区别在于回归模型定义的是某一时刻的风险。如果我们有多个感兴趣的自变量,记为 ,则时间 时的风险 可以表示为
The hazard function is closely related to the survival curve, representing the risk of dying in a very short time interval after a given time, assuming survival thus far. It can therefore be interpreted as the risk of dying at time . Cox's method is equivalent in its capability to multiple regression analysis as described in section 12.4, except that the regression model defines the hazard at a given time. If we have several independent variables of interest, say to , we can express the hazard at time , , as

式中 是从数据中估计得到的,显然对应于所有变量均为零时的风险(因为 ),称为基线或基础风险函数。回归系数 也需要估计。如果只有一个感兴趣的变量,比如年龄,则有
The quantity in the equation is estimated from the data, and clearly corresponds to the hazard when all the variables are zero (because ). It is called the baseline or underlying hazard function. The regression coefficients, to , also have to be estimated. If we have just one variable of interest, such as age, then we have

在此模型下,年龄的比例变化,例如从40岁增加到60岁,即增加50%,会导致风险对数的比例变化。实际上,比例风险回归模型常被发现非常适合用于生存数据建模,但比例风险的假设可以且应当被检验。
Under this model a proportional change in age, such as a increase from 40 to 60 years, results in a proportional change in the log of the hazard. In practice the proportional hazards regression model is often found very suitable for modelling survival data, but the assumption of proportional hazards can and should be tested.

风险函数表示在时间 死亡的风险,因此我们可以将时间0到时间 的所有风险累加,得到在这段时间内死亡的风险。这称为累积风险,记为 。其定义为
The hazard gives the risk of dying at time , so we can add all the hazards up to time to get the risk of dying between time 0 and time . This is called the cumulative hazard, . It is defined as

其中 是基础累积风险函数。由于 的计算方式,可以证明生存到时间 的概率 可估计为 。因此,我们可以估计具有模型中具体变量值的个体的生存概率。
where is the cumulative underlying hazard function. Because of the way is calculated it can be shown that the probability of surviving to time , , can be estimated by . We can thus estimate the survival probability for any individual with specific values of the variables in the model.

13.6.1 解释 13.6.1 Interpretation

Cox模型必须使用合适的计算机程序进行拟合。有些程序支持逐步选择变量。Cox回归分析的最终模型将给出风险作为多个协变量函数的方程。我们如何解释结果?
The Cox model must be fitted using an appropriate computer program. Some allow for stepwise selection of variables. The final model from a Cox regression analysis will yield an equation for the hazard as a function of several covariates. How can we interpret the results?

选择纳入模型的变量完全按照第12.4节中描述的方法进行。因此,我假设我们已经获得了一个模型,并希望对其进行解释,特别是针对具有模型中某些变量值(通常称为协变量)的新患者的预后。
The selection of variables for inclusion in the model follows exactly the same lines as described in section 12.4. I shall thus assume that we have obtained a model and wish to interpret it, especially in relation to the prognosis of a new patient with certain values of the variables in the model (often called covariates).

对一项长期随机试验数据进行了Cox回归分析,该试验比较了硫唑嘌呤与安慰剂在治疗原发性胆汁性肝硬化(PBC)患者中的效果。所选模型包含表13.6中显示的六个变量,每个变量在5%显著性水平下至少是统计显著的。模型见表13.7。一个近似的
Cox regression analysis was performed on the data from a long randomized trial comparing azathioprine and placebo in the treatment of patients with primary biliary cirrhosis (PBC). The chosen model included the six variables shown in Table 13.6, each of which was statistically significant at the level at least. The model is shown in Table 13.7. An approximate

表13.6 纳入Cox回归模型的变量,模型拟合于一项临床试验数据,该试验比较硫唑嘌呤与安慰剂对216例原发性胆汁性肝硬化患者生存的影响(Christensen等,1985)。第二列显示了回归分析中使用的变量评分。
Table 13.6 Variables included in Cox regression model fitted to data from a clinical trial comparing the effects of azathioprine and placebo on the survival of 216 patients with primary biliary cirrhosis (Christensen et al., 1985). The second column shows the scoring of the variables used in the regression analysis

变量评分
血清胆红素log10(μmol/l的数值)
年龄exp[(年龄(岁)- 20)/10]
肝硬化0 = 否;1 = 是
血清白蛋白g/l的数值
中心性胆汁淤积0 = 否;1 = 是
治疗方案0 = 硫唑嘌呤;1 = 安慰剂
VariableScoring
Serum bilirubinlog10(value in μmol/l)
Ageexp[(age in yrs - 20)/10]
Cirrhosis0 = No; 1 = Yes
Serum albuminvalue in g/l
Central cholestasis0 = No; 1 = Yes
Therapy0 = Azathioprine; 1 = Placebo

表13.7 Cox回归模型拟合于PBC试验中硫唑嘌呤与安慰剂的数据
Table 13.7 Cox regression model fitted to data from PBC trial of azathioprine versus placebo

变量回归系数 (b)标准误 (SE(b))e^b
血清胆红素2.5100.31612.31
年龄0.006900.001621.01
肝硬化0.8790.2162.41
血清白蛋白-0.05040.01810.95
中心性胆汁淤积0.6790.2751.97
治疗0.5200.2071.68
VariableRegression coefficient (b)SE(b)e^b
Serum bilirubin2.5100.31612.31
Age0.006900.001621.01
Cirrhosis0.8790.2162.41
Serum albumin-0.05040.01810.95
Central cholestasis0.6790.2751.97
Therapy0.5200.2071.68

每个变量的显著性检验通过将回归估计值除以其标准误,然后与标准正态分布进行比较得到。
test of significance for each variable is obtained by dividing the regression estimate by its standard error and comparing the result with the standard Normal distribution.

这种表格中首先需要注意的是回归系数的符号。正号表示该变量值较高的受试者风险更高,因此预后更差。因此,从表13.7来看,较高的血清胆红素和年龄与较差的生存率相关,而较高的血清白蛋白值则有益。三个二元(0-1)变量显示无肝硬化(PBC中不一定存在)和无中心性胆汁淤积的受试者预后更好,且接受硫唑嘌呤治疗者预后优于安慰剂组。
The first feature to note in such a table is the sign of the regression coefficients. A positive sign means that the hazard is higher, and thus the prognosis worse, for subjects with higher values of that variable. Thus, from Table 13.7 higher serum bilirubin and age are associated with poorer survival, but higher values of serum albumin are beneficial. The three binary (0- 1) variables show better prognosis for subjects without cirrhosis (not necessarily present in PBC) and without central cholestasis, and also for subjects treated with azathioprine rather than placebo.

单个回归系数的解释相当简单。对于协变量 的两个不同取值 ,其回归系数为 时,两个值的风险比为
An individual regression coefficient is interpreted quite easily. The ratio of the estimated hazards for two different values of a covariate , say and , with regression coefficient , is given by

注意,由于模型中的假设,该结果不依赖于时间 的选择。同时我们也不需要知道基线风险函数 的值。在二元变量(编码为0或1)的特殊情况下,风险比等于 (见表13.7)。因此,安慰剂组的估计风险为 (即168%)相对于硫唑嘌呤组。等效地,硫唑嘌呤的作用是将风险降低到 (即59%)相对于安慰剂组。然而,生存概率的影响不能简单描述,因为它依赖于患者模型中其他变量的取值,如下所述。对于连续协变量,回归系数表示协变量值增加1时对数风险的增加。由于线性效应假设,这意味着白蛋白从30增加到 的风险变化与从40增加到 的变化相同,均为 ,即降低5%。对于血清胆红素, 表示对数刻度增加1时的风险变化。因此,如果胆红素增加10倍,估计风险增加12.3倍。注意,估计的风险比 类似于第13.3.2节中描述的风险比,不同之处在于此风险比已调整模型中其他变量的影响。
Note that because of the assumption in the model this result is not dependent upon the choice of time t. Notice too that we do not need to know the value of the baseline hazard function, . In the special case where we have a binary variable coded 0 or 1 the hazard ratio is equal to (see Table 13.7). Thus the estimated hazard with placebo is (or ) of that with azathioprine. Equivalently, the effect of azathioprine is to reduce the hazard to (or ) of that with placebo. The effect on the survival probability, however, cannot be described simply as it depends on the patient's values of the other variables in the model, as described below. For continuous covariates the regression coefficient refers to the increase in log hazard for an increase of 1 in the value of the covariate. Because of the assumption of a linear effect this means that the estimated change in hazard of albumin increasing from 30 to is the same as a change from 40 to , and is equal to , i.e. a reduction of . For serum bilirubin the value of corresponds to the change in hazard for an increase of 1 in the log scale. Thus the estimated hazard increases 12.3 times if bilirubin is higher by a factor of 10. Notice that the estimated hazard ratio is analogous to that described in section 13.3.2. The difference is that this hazard ratio is adjusted for the effects of the other variables in the model.

与普通多元线性回归和逻辑回归(均在第12章讨论)类似,回归系数与变量值的组合可用作预后指数。风险函数括号内的部分给出预后指数(PI):
As with ordinary multiple linear regression and logistic regression (both discussed in Chapter 12), the combination of regression coefficients and values of variables can be used as a prognostic index. The part of the equation for the hazard function within brackets gives a prognostic index (PI) as

任何时间点的风险和估计生存概率仅依赖于PI,而不依赖于单个变量的具体值。因为时间 的生存概率为 ,我们有
The hazard and the estimated survival probability at any time depend only upon PI, not upon the values of the individual variables. Because the survival probability at time is we have

累计的基础风险函数 是随时间变化的阶梯函数,应由计算机程序输出给出。因此,我们也可以将 表示为阶梯函数。一些程序可能直接给出对应于 的生存函数,即 。任意协变量组合的生存函数为
The cumulative underlying hazard function, , is a step function over time, and should be given in the output of the computer program. We can thus express as a step function too. Some programs may instead give the survival function corresponding to , i.e. . The survival function for any set of covariates is given by

图13.7展示了基于表13.7中模型且将其他变量设为均值的情况下,接受硫唑嘌呤和安慰剂治疗患者的估计生存曲线。通过固定最后一式中的 (可选几个感兴趣的时间点),可以考察生存概率与预后指数(PI)之间的关系。图13.8展示了PBC试验中,2年、5年和8年生存概率随PI变化的估计曲线。对于新患者,可以轻松估计其在特定时间内的生存概率。不幸的是,计算该估计生存概率的置信区间较为困难。
Figure 13.7 shows estimated survival curves for patients given azathioprine and placebo, based on the model shown in Table 13.7 and setting all other variables to their mean values. The relation between survival probability and prognosis can be examined by fixing in the last equation, perhaps at a few values of interest. Figure 13.8 shows estimated 2, 5 and 8 year survival probability as a function of PI derived from the PBC trial. For any new patient it is easy to estimate the probability of surviving a given time. Unfortunately, it is difficult to calculate a confidence interval for the estimated survival probability.


图13.7 基于表13.7中Cox模型的硫唑嘌呤与安慰剂治疗患者的估计生存曲线(摘自Christensen等,1985)。
Figure 13.7 Estimated survival curves for patients treated with azathioprine or placebo based on the Cox model in Table 13.7 (from Christensen et al., 1985).


图13.8 PBC试验中,2年、5年和8年生存概率随预后指数(PI)变化的估计曲线。注意治疗方案已包含在PI中(摘自Christensen等,1985)。
Figure 13.8 Estimated 2, 5 and 8 year survival probability as a function of the prognostic index (PI) in the trial of azathioprine versus placebo. Note that the therapy given is incorporated in PI (from Christensen et al., 1985).

13.6.2 技术说明 13.6.2 Technical note

在普通多元回归(第12.4节)中,通过散点图可以轻松检验因变量与预测变量之间的线性关系。由于部分生存时间存在删失,我们无法采用相同方法,也无法用常规方式计算残差。本章不讨论Cox模型拟合优度的全面评估,但可简要说明预测变量(协变量)可能的变换。存在方法检验对风险函数影响的线性假设。表13.7中年龄和胆红素的变换即基于此类考虑。当线性假设存疑时,建议将有序变量分为三个或更多等大小组,变量可作为两个或多个虚拟变量纳入模型,或用组编码进行趋势检验。
With ordinary multiple regression (section 12.4) the assumption of a linear relation between the outcome and predictor variables is easily examined by scatter diagrams. Because of the censoring of some survival times we cannot use the same approach here, nor can we calculate residuals in the usual way. A general discussion of assessing the goodness- of- fit of the Cox model is beyond the scope of this chapter. However, some brief comments can be made regarding the possible transformation of predictor variables (covariates). There are ways to examine the linearity of effect on the hazard function. The transformations of age and bilirubin seen in Table 13.7 were based on such considerations. Where linearity of effect is in doubt it may be preferable to divide the ordered values into three or more equally sized groups. The variable can then be entered into the model as two or more dummy variables or the group codes can be used to test for trend.

此外,若变量分布高度偏斜,极端值会对模型选择产生过度影响,因此可考虑对变量取对数以减弱极端值影响。PBC试验中胆红素数据如图4.10所示,呈高度偏斜的对数正态分布。本研究中,胆红素数据的对数变换基于上述两点理由。
Also, if a variable has a highly skewed distribution the extreme values will exert an undue influence on the choice of model. We might therefore wish to take logarithms to reduce the effect of extreme values. The bilirubin data from the PBC trial were shown in Figure 4.10 to have a highly skewed Lognormal distribution. In this study log transformation was indicated for the bilirubin data on both counts.

13.6.3 评述 13.6.3 Comment

13.6.3 评论
Elashoff(1983)和 Tibshirani(1982)对 Cox 回归进行了非技术性的讨论。Christensen(1987)提供了更详细但相对非数学化的解释,他还考虑了协变量值可能随时间变化的更复杂模型。对于生存数据的 Cox 回归分析,应寻求统计专家的指导。
13.6.3 CommentNon- technical discussion of Cox regression is given by Elashoff (1983) and Tibshirani (1982). A more detailed but fairly non- mathematical explanation is given by Christensen (1987), who also considers the more complicated model in which the values of the covariates may themselves vary over time. Expert statistical advice should be sought for carrying out Cox regression on survival data.

13.7 生存研究的设计 13.7 DESIGN OF SURVIVAL STUDIES

当主要关注的结果是生存时间时,研究设计应包含一些特殊考虑。最重要的是要认识到,比较两个或多个组生存情况的检验效能,与总样本量无关,而是与感兴趣事件(如死亡)的发生次数相关。当感兴趣事件的风险较低时,可能需要非常大的研究规模。因此,提高研究效能的一种方法是选择更常见的事件作为研究终点,例如使用原发病复发或死亡,而不仅仅是死亡。(许多癌症患者研究报告分别分析了复发时间和死亡时间。)其他提高效能的方法包括增加总样本量和延长每个受试者的随访时间。例如,前述的 PBC 试验在患者招募期间持续了六年,随后又有六年的随访。因此,患者的潜在随访时间为6至12年。即便如此,仍需在多个国家招募患者才能获得足够的事件数。在最终分析的216名患者中,仅有105名死亡,这对于研究的效能而言并不多。
When the main outcome of interest is survival time, planning of a study should include some special considerations. It is most important to realize that the power of a test to compare survival in two or more groups is related not to the total sample size but to the number of events of interest such as deaths. When there is a small risk of the event of interest a vast study may be needed. One way to increase the power of a study is therefore to consider taking a more common event as the end- point of the study, such as using either recurrence of the original condition or death rather than death alone. (Many reports of studies of cancer patients give separate analyses relating to both time to recurrence and time to death.) Other ways to increase power are to increase the total sample size and to extend the length of follow- up of each subject. For example, the PBC trial just discussed had a six year period during which patients were recruited to the trial, and a further six years' follow- up. Thus patients were potentially followed for between 6 and 12 years. Even then it was only possible to get adequate numbers by recruiting patients in several countries. Of the 216 patients included in the final analysis only 105 had died, which is not a large number when the power of the study is considered.

由于上述各种影响,计算生存时间研究的合适样本量并不简单。Machin 和 Campbell(1987)提供了相关表格,Schoenfeld 和 Richter(1982)则提供了计算样本量的列线图。
Because of the various effects described, it is not simple to calculate the appropriate sample size for survival time studies. Machin and Campbell (1987) give tables and Schoenfeld and Richter (1982) give a nomogram for calculating sample size.

除这些考虑外,以生存时间为终点的研究设计与其他研究设计原则相同。第5章和第15章讨论了设计问题,Peto 等人(1976)的一篇论文值得一读,结合他们对这类研究分析的描述(Peto 等人,1977)。
Apart from these considerations the design of studies with survival time as the end- point are subject to the same considerations as other studies. Chapters 5 and 15 discuss design, and Peto et al. (1976) is a valuable paper to read in conjunction with their description of the analysis of such studies (Peto et al., 1977).

13.8 结果的呈现 13.8 PRESENTATION OF RESULTS

13.8 结果的呈现
生存研究在结果呈现方面需特别注意。图形展示对生存数据尤为重要。以下建议是对其他章节中更通用的临床试验结果展示方法的补充。
13.8 PRESENTATION OF RESULTSStudies of survival require special consideration with respect to the presentation of the results. Graphical display is especially important for survival data. The suggestions below are in addition to those that may

(这些建议)适用于更广泛的情况,例如其他章节中描述的临床试验。
apply more generally, for example for clinical trials, described in other chapters.

13.8.1 数值呈现 13.8.1 Numerical presentation

应报告受试者随访时间的分布,通常给出范围即可。仅引用最长随访时间可能会产生误导。还应分别报告各类感兴趣事件(如死亡和疾病复发)的发生人数,若不同组别间存在差异,则应分别列出。
The distribution of the length of follow- up of subjects should be given; the range will probably suffice. It may mislead to quote only the maximum follow- up period. It is also useful to indicate the numbers of failures of each type of interest (e.g. deaths and recurrences of disease), separately for different groups of subjects if this is relevant.

logrank 检验的结果应给出观察到的失败次数 和期望的失败次数 ,以及检验统计量 值。
The results of logrank tests should be given as the observed and expected numbers of failures as well as the test statistic and .

13.8.2 图形展示 13.8.2 Graphical display

生存曲线图极其有价值。应基于 Kaplan-Meier 方法,或者对于按时间间隔分组的数据,可使用生命表方法。Kaplan-Meier 生存曲线应绘制为阶梯函数,如本章所示。使用不同的线型(例如实线、虚线)区分不同的受试者组别是有帮助的。对于小型研究,可以在生存曲线上用刻度标记删失观察的时间点。更一般地,定期显示仍处于风险中的人数(例如每月或每年)是有用的,这些数字可以显示在时间刻度下方或图表顶部。
Graphs of survival curves are enormously valuable. These should be based on the Kaplan- Meier method, or perhaps the life table method for data grouped by time interval. Kaplan- Meier survival curves should be drawn as step functions, as in this chapter. It is helpful to use different line types (e.g., solid, dashed) for different groups of subjects. For small studies it is possible to mark the times of censored observations by ticks on the survival curve. More generally, it is useful to show the numbers still at risk at regular intervals, for example every month or year, as appropriate. These can be shown beneath the time scale or along the top of the graph.

为避免对生存曲线右侧不可靠部分的误解,建议当仍处于风险中的人数较少时(例如五人)终止曲线。这还有助于放大包含重要信息的曲线左侧部分。
To avoid misinterpretation of the unreliable right- hand part of the survival curve it is advisable to terminate the curves when the number of subjects still at risk is small, say five. This also has the benefit of expanding the left- hand part of the curve which contains the important information.

练习 EXERCISES

【13】1 鉴于第13.4.8节的评论,进行一次logrank检验,比较表13.2和13.3中的晕动病数据,将事件(失败)定义为呕吐或在120分钟前停止。将结果与第13.3.1节给出的结果进行比较。
13.1 In view of the comment in section 13.4.8, carry out a logrank test to compare the motion sickness data in Tables 13.2 and 13.3, taking an event (failure) as either vomiting or stopping before 120 minutes. Compare the results with those given in section 13.3.1.

【13】2 练习11.1包括29名乳酸酸中毒患者的生存时间及一些可能的预后变量。
13.2 Exercise 11.1 included survival times of 29 patients with lactic acidosis, together with some possibly prognostic variables.

(a) 关于这些数据,绘制 Kaplan-Meier 生存曲线存在哪些问题?
(a) What problem is there with these data regarding a Kaplan-Meier plot of survival?
(b) 如何利用 logrank 检验评估这三个变量与生存时间的可能关系?进行相关检验。
(b) How could logrank tests be used to assess the possible relation between the three variables and survival time? Perform such tests.
(c) 比较使用这些变量进行 Cox 回归分析的结果。
(c) Compare the results of Cox regression analyses using these

变量保持原样,或各自分成大致相等的三组。
variables as they are or each divided into three roughly equal groups.

【13】练习12.3展示了37名接受骨髓移植患者中各种因素与急性移植物抗宿主病(GvHD)发生的相关数据。使用诊断、受体年龄和性别、供体年龄和性别、供体是否曾怀孕、MECLR/MLR指数及GvHD作为预测生存的变量,进行向后逐步Cox回归分析,得到以下模型:
13.3 Exercise 12.3 showed data relating various factors to the occurrence of acute graft- versus- host disease (GvHD) in 37 patients having a bone marrow transplant. Backward stepwise Cox regression analysis using diagnosis, recipient's age and sex, donor's age and sex, whether the donor had been pregnant, MECLR/MLR index and GvHD to predict survival yields the following model:

变量回归系数标准误
GvHD(0 = 否,1 = 是)2.3060.5898
CML(0 = 否,1 = 是)-2.5080.8095
VariableRegression coefficientStandard error
GvHD (0 = No, 1 = Yes)2.3060.5898
CML (0 = No, 1 = Yes)-2.5080.8095

(a) 回归系数符号相反的解释是什么?
(a) What is the interpretation of the opposite signs for the regression coefficients?

(b) 计算以下患者相对于非GvHD非CML患者的死亡相对风险(风险比):
(b) Calculate the relative risks of dying (hazard ratio) for the following patients relative to non-GvHD non-CML patients:

(i) 有GvHD但无CML者,
(i) with GvHD but not CML,

(ii) 有CML但无GvHD者,
(ii) CML but without GvHD,

(iii) 同时有CML和GvHD者。
(iii) CML and GvHD.

(c) 计算与GvHD相关的风险比的95%置信区间。
(c) Calculate the confidence interval for the hazard ratio associated with GvHD

(d) 鉴于样本量(37)和死亡人数(18),评论该Cox回归模型的可靠性。
(d) Comment on the reliability of the Cox regression model in view of the sample size (37) and number of deaths (18).

14 医学研究中的一些常见问题 14 Some common problems in medical research

尽管统计学家无所不知,但医学界通常不认可他们诊断异常的能力,实际上他们自己也通常避免声称这一点。
Omniscient as statisticians are, their ability to diagnose abnormality is not generally acknowledged by the medical community, and indeed they usually refrain from claiming it.

Oldham (1979)
Oldham (1979)

一幅图胜过千次 检验。
A picture may be worth a thousand tests.

Cooper 和 Zangwill (1989)
Cooper and Zangwill (1989)

14.1 引言 14.1 INTRODUCTION

第9至13章中描述的分析方法涵盖了医学研究中使用的大部分方法。虽然这些方法并非专门针对医学数据,尽管生存分析在医学研究中比其他领域更为常见。然而,有些类型的医学调查并不涵盖在这些方法之内。特别是流行病学研究需要许多其他领域较少使用的统计技术。已有许多专门介绍流行病学方法的书籍。
The methods of analysis described in Chapters 9 to 13 cover a high proportion of the methods used in medical research. None is specific to medical data, although survival analysis is much more common in medical research than in other fields. There are some types of medical investigation, however, that are not covered by these methods. Epidemiological studies in particular require many statistical techniques that are not used much in other fields. There are many books devoted to epidemiological methods.

本章涵盖了一些需要特殊方法处理的常见医学问题—方法比较研究、观察者一致性研究、诊断测试及参考范围的计算。这些方法的共同点是没有复杂的数学运算。它们的难点在于需要清晰理解分析目的以及结果的解释。此外,还考虑了包含对每个受试者进行一系列测量的数据分析,并推荐了一种简单的方法。最后,简要介绍了周期性变异的研究。
This chapter covers a small miscellany of common medical problems that need a special approach - method comparison studies, observer agreement studies, diagnostic tests and the calculation of reference ranges. These methods have in common the absence of any complicated mathematics. Their difficulties lie in requiring a clear understanding of the aim of the analysis, and in the interpretation of the results. Also considered is the analysis of data that comprise a series of measurements on each subject, for which a simple approach is also recommended. Lastly, there is a brief introduction to the investigation of cyclic variation.

14.2 方法比较研究 14.2 METHOD COMPARISON STUDIES

大多数临床测量并不精确。通常无法直接测量感兴趣的量,如心脏容积或肿瘤
Most clinical measurements are not precise. Either it is not possible to measure directly the quantity of interest, such as heart volume or tumour

尺寸,或者虽然测量是直接的,但测量过程较为困难,比如手臂周长。此外,该变量可能随时间变化,如最大呼气流速或血压。
size, or the measurement, although direct, is difficult to make, such as arm circumference. Further, the variable may change with time, such as peak expiratory flow rate or blood pressure.

由于这些不确定性,通常存在多种测量技术,比较两种(或多种)方法的研究很常见。这类研究的目的是查看方法之间是否“足够一致”,以便一种方法能够替代另一种,或者两种方法可以互换使用。例如,我们可能想知道一种新的廉价且/或快速的方法是否能得到与现有昂贵且缓慢方法相一致的结果。相同的考虑也适用于比较同一方法下两位观察者的研究。注意,我们需要明确“agreement”(一致性)的定义。此外,我们关注的是一致性的程度,因此这是一个估计问题,而非假设检验问题。
Because of these uncertainties there is usually a variety of techniques available and studies comparing two (or more) methods are common. The aim of these studies is usually to see if the methods 'agree' well enough for one method to replace the other, or perhaps for the two methods to be used interchangeably. For example, we may wish to see if a new cheap and/or quick method gives results that agree with those of an existing expensive, slow method. The same considerations apply to studies comparing two observers using one method. Note that we need to define what we mean by agreement. Also, we are concerned with the degree of agreement, so that this problem is one of estimation rather than hypothesis testing.

简而言之,这类数据的最佳处理方法是分析每个受试者两种方法测量值之间的差异。Bland 和 Altman(1986)对方法比较研究有更详尽的讨论。
Put simply, the best approach to this type of data is to analyse the differences between the measurements by the two methods on each subject. A fuller discussion of method comparison studies is given by Bland and Altman (1986).

14.2.1 分析 14.2.1 Analysis

表14.1显示了21名无主动脉瓣疾病患者的多普勒超声心动图测得的二尖瓣流量(MF)和横截面超声心动图测得的左心室搏出量(SV)。研究人员预期在这类患者中,两种测量值应相同,但在主动脉瓣关闭不全患者中会有所不同。因此,他们首先想了解MF和SV在无主动脉瓣疾病患者中的吻合程度。图14.1展示了数据的散点图。如果两种方法完全一致,所有点应落在等值线上,但实际数据从不完全一致。然而,我们可以看到所有数据点都相当接近等值线。更具信息量的另一种图形见图14.2。此图将两种方法的差值(SV - MF)绘制于两者测量值的平均值之上。这种图有几个优点:我们更容易看出差异的大小及其围绕零的分布,并且可以直观检查差异是否与测量值大小相关。这里的平均值作为对未知真实值的最佳估计。第14.2.2节将介绍当差异的散布随均值增加而变宽时的处理方法。图14.2未显示此类问题,因此我们可以进一步分析差异。我们可以构建直方图,并计算均值和标准差,分别为。我们可以使用单样本检验检验差异是否显著偏离零(或等价地,对原始数据使用配对检验)。
Table 14.1 shows measurements of transmitral volumetric flow (MF) by Doppler echocardiography and left ventricular stroke volume (SV) by cross- sectional echocardiography in 21 patients without aortic valve disease. The researchers expected these measurements to be the same in such patients, but to differ in patients with aortic regurgitation. They thus first wished to see how well MF and SV agreed in patients without aortic valve disease. Figure 14.1 shows a scatter diagram of the data. If the methods agreed exactly the points would all lie on the line of equality, but of course real data never agree exactly. We can see, however, that all these data points are quite near to the line of equality. An alternative, more informative plot is shown in Figure 14.2. Here the differences between the methods (SV- MF) have been plotted against the average of the two measurements. There are several advantages of this plot. We can see the size of differences much more easily and also their distribution around zero, and we can check visually that the differences are not related to the size of the measurement. For this purpose the average acts as our best estimate of the unknown true value. Section 14.2.2 describes what we do when the scatter of the differences gets wider as the mean increases. Figure 14.2 shows no such problem, so we can investigate the differences further. We can construct a histogram, and can calculate the mean and standard deviation, which are and . We could use a one sample test of the differences against zero (or, equivalently, a paired test on the original data) to see if the mean difference is significantly different

398 医学研究中的一些常见问题
398 Some common problems in medical research

表14.1 21名无主动脉瓣疾病患者的二尖瓣流量(MF)和左心室搏出量(SV)(Zhang 等,1986)。数据单位为,按MF值排序
Table 14.1 Transmitral volumetric flow (MF) and left ventricular stroke volume (SV) in 21 patients without aortic valve disease (Zhang et al., 1986). Data (in ) in order of MF values

患者MFSV
14743
26670
36872
46981
57060
67067
77372
87572
97992
108176
118585
128782
138790
148796
159082
16100100
1710494
1810598
19112108
20120131
21132131
均值86.085.8
标准差20.321.2
PatientMFSV
14743
26670
36872
46981
57060
67067
77372
87572
97992
108176
118585
128782
138790
148796
159082
16100100
1710494
1810598
19112108
20120131
21132131
Mean86.085.8
SD20.321.2

与零的差异,但更重要的是量化单个数据点的变异性。
from zero, but it is more important to quantify the variability of the individual data points.

这里的问题是测量方法的一致性,答案包含两个方面。首先,均值差异估计了一种方法相对于另一种方法的平均偏差。这里均值差异可忽略,说明两种方法平均而言高度一致。其次,必须考虑两种方法对个体的一致性程度,为此我们使用差异的标准差。虽然可以直接用差异的标准差作为一致性(或不一致性)的度量,但更有用的是利用标准差构建一个范围,预计该范围能覆盖大多数受试者方法间的一致性。
The question being asked relates to how well the methods agree, and there are two components to the answer. Firstly, the mean difference is an estimate of the average bias of one method relative to the other. Here the mean is negligible and we can say that the methods agree excellently on average. Secondly, it is essential to consider also how well the methods are likely to agree for an individual, for which purpose we use the standard deviation of the differences. Although we could simply quote the standard deviation of the differences as a measure of agreement (or disagreement), it is more useful to use the standard deviation to construct a range of values which we expect to cover the agreement between the methods for most subjects.


图14.1 二尖瓣容积流量(MF)与左心室搏出量(SV)。数据来源:Zhang 等(1986)。
Figure 14.1 Transmitral volumetric flow (MF) and left ventricular stroke volume (SV). Data from Zhang et al. (1986).


图14.2 二尖瓣容积流量与左心室搏出量之差(SV-MF)与平均值 的散点图。
Figure 14.2 Difference between transmitral volumetric flow and left ventricular stroke volume (SV-MF) plotted against average,

我们在第3.4节看到,对于较为对称的分布,期望均值 的范围包含约 的观测值。因此,对于方法比较研究,可以将均值 作为个体间 的一致性范围。该范围定义了 的一致性界限。对于当前数据,范围为
We saw in section 3.4 that for reasonably symmetric distributions we expect the range mean to include about of the observations. For a method comparison study we can therefore take mean as a range of agreement for individuals. This range of values defines the limits of agreement. For the present data we get a range from

400 医学研究中的一些常见问题
400 Some common problems in medical research

。换言之,对于新个体,预期两种方法测量值的差异小于 ,且差异方向同等可能。
which is to . In other words, for a new subject we expect the two methods to give measurements that differ by less than , with any discrepancy being equally likely in either direction.

研究者还比较了25例主动脉瓣病患者的MF和SV。图14.3展示了有无疾病患者两种方法差异的比较。仅有2例主动脉瓣病患者的SV-MF落在无病患者的 一致性界限内,支持了研究者的预期。
The researchers also compared MF and SV in 25 patients with aortic valve disease. Figure 14.3 compares the differences between the methods for patients with or without disease. For only two of the 25 patients with aortic valve disease was SV- MF within the limits of agreement for patients without disease, supporting the researchers' expectations.

差异的均值和标准差的解释必须依赖临床情况—统计学无法定义可接受的一致性。
The interpretation of the mean and standard deviation of the differences must depend upon the clinical circumstances - it is not possible to use statistics to define acceptable agreement.


图14.3 有无主动脉瓣疾病患者SV与MF的差异,显示无病患者的 一致性界限。
Figure 14.3 Differences between SV and MF for patients with or without aortic valve disease, showing limits of agreement for patients without disease.

14.2.2 变量一致性(差异与均值的关系) 14.2.2 Variable agreement (relation between difference and mean)

有时,将两种方法的差值绘制在平均值上的图表显示,随着平均值的增加,散布范围变宽。换句话说,差值的标准差增加了。尽管上一节中给出的方法可能并非不合理,但通常通过对数据取对数后再计算一致性限,更能获得更好的分析结果。在这里,我们隐含地认为两种方法之间的差异大致是测量值大小的一个恒定比例。
Sometimes a plot of the differences between two methods against the average shows that there is a wider scatter as the average increases. In other words, the standard deviation of the differences increases. Although the approach given in the previous section may not be unreasonable, a better analysis is often obtained by taking logs of the data before calculating the limits of agreement. Here we are implicitly considering the differences between methods to be an approximately constant proportion

与前几章描述的其他对数变换应用类似,我们对数据的对数进行常规分析,然后对结果进行反变换。一致性限的反对数因此给出了方法之间比例一致性的范围。例如,我们可能得出结论,对于一个新受试者,方法A的测量值很可能在方法B测量值的80%到130%之间。Bland和Altman(1986)讨论了这种类型的分析,并给出了一个具体的示例。
of the size of the measurement. As with other uses of the log transformation described in previous chapters, we perform the usual analysis on the logs of the data and then back- transform the results. Antilogs of the limits of agreement thus give us a range of proportional agreement between the methods. For example, we may conclude that for a new subject method A will be likely to give a value between and of that obtained by Method B. Bland and Altman (1986) discuss this type of analysis, and give a worked example.

14.2.3 重复性 14.2.3 Repeatability

方法比较的一个重要方面是比较各方法的重复性。如果我们对同一受试者使用每种方法进行了两次(或更多次)测量,就可以评估使用相同技术进行的重复测量之间的相似性。对于配对观察,我们只需计算同一方法两次测量差值的标准差。然后可以比较这些标准差,以判断哪种方法的重复性更好。每个标准差也可以用来计算两个同一方法测量值差异预期落入的范围。Bland和Altman(1986)给出了一个具体示例。
An important aspect of method comparison is the comparison of the repeatability of each method. If we have two (or more) measurements of the same subjects by each method then we can assess the similarity of the duplicate measurements made using the same technique. For paired observations we simply calculate the standard deviation of the differences between the pairs of measurements using the same method. We can then compare the standard deviations to see which method is more repeatable. Each standard deviation can also be used to calculate limits within which we expect the differences between two measurements by the same method to lie. Bland and Altman (1986) give a worked example.

在方法比较研究中,重复测量很少进行,因此一个重要的可比性方面常被忽视。重复性差的方法永远无法与另一种方法达成良好一致。
Replicate measurements are rarely made in method comparison studies, so that an important aspect of comparability is often overlooked. A method with poor repeatability will never agree well with another method.

14.2.4 错误的分析 14.2.4 Erroneous analyses

方法比较研究常常被误分析。特别是,常常计算两种方法测得值之间的相关性,并将较高的 值解释为良好一致性的标志。相关性不适合作为分析方法有几个原因。首先,相关系数是衡量两个变量之间线性关联强度的指标,这与衡量一致性不同。正如我们所见,一致性应以直接与测量相关的指标来评估。不能将例如 与一致性界限等同解释。其次,即使临床上一致性很差,相关性也可能很高。例如,在一项关于膝围测量变异性的研究中,Kirwan 等人(1979)发现,两位观察者在髌骨上方 15 cm 处的测量重复性非常差,测量结果临床价值有限。然而,两位观察者的读数相关性高达 0.99。高 值可能是因为,如该研究中,受试者之间存在较大变异。显然,使用对受试者间变异高度敏感的统计方法来评估一致性是不合理的。
Method comparison studies are frequently mis- analysed. In particular, the correlation between the values by the two methods is often calculated, with a high value of interpreted as an indication of good agreement. There are several reasons why correlation is an inappropriate analysis. Firstly, the correlation coefficient is a measure of the strength of linear association between two variables, which is not the same as a measure of agreement. As we have seen, agreement should be assessed in terms directly related to the measurements. It is not possible to interpret, say, in the same way as the limits of agreement. Secondly, we may have a high degree of correlation when the agreement is clinically poor. For example, in a study of the variability of knee circumference measurements Kirwan et al. (1979) found that the repeatability of measurements made above the patella by two observers was far too poor for the measurement to be clinically valuable. Nevertheless, there was a correlation of 0.99 between the observers' readings. A high value of can be obtained because, as in their study, there is large variation between subjects. It is clearly not reasonable to assess agreement by a statistical method that is highly sensitive to the

受试者样本的选择。对使用回归分析评估一致性也可以提出类似的批评。
choice of the sample of subjects. Similar criticisms can be levelled at the use of regression analysis for assessing agreement.

另一种常见的错误分析是通过假设检验比较均值,通常是配对 检验。我们不能因为方法之间没有显著差异就推断它们一致性良好。实际上,差异的高度散布可能导致均值(偏倚)存在重要差异却不显著。采用这种方法时,较差的一致性反而降低了发现显著差异的可能性,从而增加了方法看似一致的概率!
Another common incorrect analysis is the comparison of means by a hypothesis test, often a paired test. We cannot deduce that methods agree well because they are not significantly different. Indeed a high scatter of differences may well lead to an important difference in means (bias) being non- significant. Using this approach worse agreement decreases the chance of finding a significant difference and so increases the chance that the methods will appear to agree!

14.2.5 表示 14.2.5 Presentation

使用第14.2.1节的方法比较测量方法既简单又信息丰富。均值差异和一致性限度能够很好地总结数据。最好配合一两个图表,尤其是一张显示差异与均值关系的图表,其他数值可以叠加为三条水平线。原始数据的散点图,如图14.1,应为正方形,并显示等值线。
Comparing methods of measurement is very simple and informative using the approach of section 14.2.1. The mean difference and limits of agreement give an excellent summary of the data. It is useful to have one or two plots as well, especially one showing the difference against the mean, on which the other values can be superimposed as three horizontal lines. A plot of the raw data, such as in Figure 14.1, should be square and should show the line of equality.

14.2.6 讨论 14.2.6 Discussion

我们应当记住这种分析方法的局限性。由于通常不知道真实值,我们无法判断哪种方法更接近“真值”。对于未重复测量的研究,也无法比较不同测量方法的重复性。重要的是要认识到,如果某一方法不准确或重复性差(或两者皆有),那么与任何其他方法的比较都必然显示出较差的一致性,无论第二种方法多么优秀。因此,不能因为一致性差就断定两种方法都不好。相反,除非两种方法都准确且重复性好,否则很难出现良好的一致性。
We should remember the limitations of this type of analysis. We cannot tell which method is nearer to the 'truth' because we do not usually know the true values. Nor for unreplicated studies can we compare the repeatability of different methods of measurement. It is important to realize that if one method is either inaccurate or has poor repeatability (or both) comparison with any other method will inevitably show poor agreement, however good the second method is. Thus we should not infer from poor agreement that both methods are poor. In contrast, good agreement is most unlikely unless we have two methods that are both accurate and repeatable.

方法比较研究的设计需谨慎。样本量应足够大,以便准确估计一致性限度。我们可以计算一致性限度的置信区间,小样本时置信区间会较宽。因此,方法比较研究理想的样本量至少为50,最好更多。对每个受试者用每种方法测量两次非常有价值,这样可以比较两种方法的重复性。分析可以基于两次重复测量的平均值,但必须对差异的标准差进行修正以考虑这一点(Bland 和 Altman,1986)。比较的两种技术最好不要由不同观察者执行。观察者间的系统性差异(常见现象)会与方法间差异混淆。然而,当技术要求较高且需要丰富经验时,这种情况可能不可避免。
Care should be taken with the design of method comparison studies. The sample size should be large enough to allow the limits of agreement to be estimated well. We can calculate confidence intervals for the limits of agreement, and these will be wide in small samples. Thus a sample size of at least 50, but preferably rather larger, is desirable for a method comparison study. It is definitely valuable to take two measurements on each subject by each method, so that the repeatability of the two methods can be compared. The analysis can then be based on the average of the two replicates, but a correction must then be made to the standard deviation of the differences to allow for this fact (Bland and Altman. 1986). It is most undesirable for the two techniques being compared to be carried out by different observers. Any systematic variation between

observers (a common phenomenon) will be inseparable from any difference between methods. This may be necessary, however, when the techniques involve considerable skill and experience.

如膝围例子所示,我们可以对观察者可比性研究使用相同的统计方法。然而,当比较的是类别评估而非连续测量时,不能使用此方法。第14.3节讨论了这类问题,这类问题通常出现在观察者比较中,而非方法比较。
As indicated by the knee circumference example, we can use the same statistical approach for studies of observer comparability. We cannot, though, use this method when comparing assessments in categories as opposed to measurements. Section 14.3 considers such problems, which usually arise in observer comparisons rather than method comparisons.

14.3 观察者间一致性 14.3 INTER-RATER AGREEMENT

类别评估间的一致性通常被视为比较不同观察者将受试者分类到多个组中能力的问题。下面介绍的方法同样适用于比较两种不同分类方案的研究,即类别数据的方法比较研究。我将分别举例说明。
Agreement between categorical assessments is usually considered as a problem of comparing the ability of different raters (observers) to classify subjects into one of several groups. The approach outlined below does, however, also apply to studies that compare two alternative categorization schemes, that is, a method comparison study for categorical data. I shall consider an example of each.

表14.2显示了两位放射科医师对85张干式乳腺X线照片的分类,类别包括“正常”、“良性疾病”、“癌症疑虑”或“癌症”。数据来自一项涉及九位放射科医师的大型研究(Boyd等,1982)。与上一节讨论的连续数据比较类似,我们需要某种一致性度量而非关联度量。因此,我们不使用检验,原因在于我们不想评估关联性,且这也不是假设检验问题。(此外,数据是成对的)
Table 14.2 shows the classification by two radiologists of 85 xeromammograms as 'Normal', 'Benign disease', 'Suspicion of cancer' or 'Cancer'. The data come from a larger study of nine radiologists (Boyd et al., 1982). As with the comparison of continuous data discussed in the previous section, we require some measure of agreement rather than association. Thus we do not use the test, both because we do not wish to assess association and also because this is not a hypothesis testing problem. (Further, the data are paired).

表14.2 两位放射科医师对85张干式乳腺X线照片的评估(Boyd等,1982)
Table 14.2 Assessments of 85 xeromammograms by two radiologists (Boyd et al., 1982)

放射科医师 A 正常放射科医师 B
良性疑似癌症癌症总计
正常21120033
良性4171022
疑似癌症3915229
癌症00011
总计283816385
Radiologist A NormalRadiologist B
BenignSuspected cancerCancerTotal
Normal21120033
Benign4171022
Suspected cancer3915229
Cancer00011
Total283816385

14.3.1 测量一致性 14.3.1 Measuring agreement

评估一致性的最简单方法是直接计算观察到的完全一致的次数,这里为
The simplest approach to assessing agreement is simply to see how many exact agreements were observed, which here is .

因此,有 (64%)的片子达成了一致。这个简单计算有两个缺点。首先,它没有考虑一致性发生在表格的哪个位置;其次,即使放射科医师在猜测,我们也会期望他们之间存在一定的偶然一致。通过考虑超出偶然一致的部分,我们可以得到更合理的答案。
There is thus agreement for of the films. There are two weaknesses of this simple calculation. Firstly, it takes no account of when in the table the agreement was, and secondly, we would expect some agreement between the radiologists by chance even if they were guessing. We can get a more reasonable answer by considering the agreement n excess of the amount of agreement that we would expect by chance.

我们在第10.3节中看到,频数表中某个单元格的期望频数(在无关联的原假设下)是相关列总数与相关行总数的乘积除以总样本数。因此,表14.2中对角线上的期望频数为
We saw in section 10.3 that the expected frequency in a cell of a frequency table (under the null hypothesis of no association) is the product of the total of the relevant column and the total of the relevant row divided by the grand total. Thus the expected frequencies along the diagonal in Table 14.2 are

总计
Total

26.20

因此,仅由偶然产生的一致次数为26.2,占总数的比例为 。问题是,放射科医师的实际一致性比0.31好多少。最大一致性为1.00,所以我们可以将放射科医师的一致性表示为超出偶然一致的最大可能范围的比例,即 。然后计算一致性为
So the number of agreements expected just by chance is 26.2, which as a proportion of the total is . The question, therefore, is how much better were the radiologists than 0.31. The maximum agreement is 1.00, so we can express the radiologists' agreement as a proportion of the possible scope for doing better than chance, which is . We then calculate the agreement as

这种一致性度量称为kappa,记作 。当一致性完美时,kappa最大为1.00;值为零表示没有优于偶然的一致;负值表示一致性比偶然还差,这在本情境中不太可能出现。
The name for this measure of agreement is kappa, written . It has a maximum of 1.00 when agreement is perfect, a value of zero indicates no agreement better than chance, and negative values show worse than chance agreement, which is unlikely in this context.

我们如何解释介于0和1之间的值,例如0.47?虽然可以给出绝对定义,但以下指南(稍作改编自Landis和Koch,1977)应有所帮助:
How do we interpret values between 0 and 1, such as 0.47? While an absolute definitions are possible the following guidelines (slightly adapted from Landis and Koch, 1977) should help:

x的值一致性强度
< 0.20
0.21–0.40一般
0.41–0.60中等
0.61–0.80良好
0.81–1.00非常好
Value of xStrength of agreement
&lt; 0.20Poor
0.21–0.40Fair
0.41–0.60Moderate
0.61–0.80Good
0.81–1.00Very good

因此,我们可以说放射科医师之间存在中等程度的一致性。有趣的是,这两位观察者表现出研究中任何观察者对之间最好的一致性。
We can thus say that there was moderate agreement between the radiologists. It is of some interest that these two observers showed the best agreement of any pair of observers in the study.

将数据简化为单个数字不可避免地会导致答案缺乏深刻意义,除非结合频数表进行分析。实际上,任何低于0.5的值都表明一致性较差,尽管可接受的一致性程度应视具体情况而定。频数表的检查是不可替代的,因为许多不同的表格可能产生相似的值。
The reduction of the data to a single number inevitably yields an answer that is not terribly meaningful without examination of the table of frequencies. In practice, any value of much below 0.5 will indicate poor agreement, although the degree of acceptable agreement must depend upon circumstances. There is no substitute for inspecting the table of frequencies, because many different tables will yield similar values of .

表14.3中的数据提供了分类评估替代方法比较的示例。该研究旨在比较放射性过敏吸附试验(RAST)和多重RAST(MAST)试验,用于检测不能进行皮肤点刺试验受试者血清中特异性IgE的过敏测试。MAST是一种新型、更简单且更经济的方法。
An example of the comparison of alternative methods of categorical assessment is given by the data in Table 14.3. The aim of the study was to compare a radioallergosorbent (RAST) test and a multi- RAST (MAST) test on sera for specific IgE as a test of allergy in subjects for whom prick tests cannot be used. The MAST was a new, simpler and cheaper method.

如表14.3所示,两种方法间存在相当大的不一致,几乎所有表格单元格中都有样本。表14.3的值为0.32,证实了视觉印象。
As Table 14.3 shows, there was considerable disagreement between the methods, with some samples in nearly all the cells of the table. The value of for Table 14.3 is 0.32, confirming the visual impression.

14.3.4节展示了计算的数学表达式。
Section 14.3.4 shows the mathematical expression for calculating .

表14.3 RAST和MAST血清过敏测试方法比较(Brostoff等,1984)
Table 14.3 Comparison of RAST and MAST methods of testing serum for allergies (Brostoff et al., 1984)

MAST阴性 1弱 2RAST非常高 总计
中等 3高 45
阴性 (1)8631402105
弱 (2)260104040
中等 (3)202224149
高 (4)11137161479
非常高 (5)3015244890
总计1466984865363
MASTNegative 1Weak 2RASTVery high Total
Moderate 3High 45
Negative (1)8631402105
Weak (2)260104040
Moderate (3)202224149
High (4)11137161479
Very high (5)3015244890
Total1466984865363

14.3.2 置信区间 14.3.2 Confidence interval

我们可以得到 的标准误差,从而计算置信区间。一般来说,这并不是特别有用,因为除非样本量很小,否则置信区间会很窄,因此对解释的变化空间有限。对于放射科医师的评估,我们得到 ,并计算出 ,因此 的95%置信区间为0.33到0.61。对于规模较大的MAST/RAST研究, 为0.32,95%置信区间为0.26到0.38。计算方法见第14.3.4节。
We can obtain a standard error for , and thus a confidence interval. In general this is not all that useful because unless the sample is small the confidence interval will be narrow and thus will not allow for much variation in interpretation. For the radiologists' assessments we had and can calculate , so that a 95% confidence interval for is given by 0.33 to 0.61. For the rather larger MAST/RAST study was 0.32 with a 95% confidence interval from 0.26 to 0.38. The method of calculation is given in section 14.3.4.

14.3.3 加权卡帕系数 14.3.3 Weighted kappa

卡帕统计量的一个缺点是它不考虑分歧的程度—所有分歧都被同等对待。当类别是有序的(通常是这样)时,可能更合适根据分歧的大小赋予不同的权重。这里,接近对角线的观察值(仅相差一个类别)被认为不如相差两到三个类别的分歧严重。
A weakness of the kappa statistic is that it takes no account of the degree of disagreement - all disagreements are treated equally. Where the categories are ordered, as is often the case, it may be preferable to give different weights to disagreements according to the magnitude of the discrepancy. Here observations near to the diagonal, representing a difference of only one category, are considered less serious than those where the discrepancy is two or three categories.

我们可以将这一思想融入 的计算中,得到称为加权卡帕系数的量。对于MAST-RAST研究,加权卡帕系数为 ,比未加权的 略好。同样,放射科医师评估的加权卡帕系数为 ,而未加权的为 。加权卡帕系数通常高于未加权卡帕系数,因为分歧更可能仅相差一个类别,而非多个类别。
We can build this idea into the calculation of to get a quantity called weighted kappa. For the MAST- RAST study weighted kappa is somewhat better than the unweighted . Similarly, weighted kappa for the radiologists' assessments is compared with unweighted . Weighted kappa is usually higher than unweighted kappa because disagreements are more likely to be by only one category than by several categories.

14.3.4 卡帕系数的数学原理 14.3.4 Mathematics for kappa

(本节可省略,不影响连贯性。)
(This section can be omitted without loss of continuity.)

卡帕系数是根据频数表中对角线上的观察频数和期望频数计算的。如果在 个类别中有 个观察值,则观察到的比例一致性为
Kappa is calculated from the observed and expected frequencies on the diagonal of a square table of frequencies. If there are observations in categories, then the observed proportional agreement is

其中 是类别 的一致次数。随机一致的期望比例为
where is the number of agreements for category i. The expected proportion of agreements by chance is given by

其中 分别是第 i 类的行和列总数。一致性指标 kappa 定义为
where and are the row and column totals for the ith category. The index of agreement, kappa, is given by

的近似标准误为
The approximate standard error of is

因此, 的总体值的 置信区间为
so that a confidence interval for the population value of is given by

加权 kappa 是通过根据表中各单元格距离对角线(表示一致性)的远近给频数赋权重得到的。对于第 行第 列的单元格,其观察频数为 ,权重计算为
Weighted kappa is obtained by giving weights to the frequencies in each cell of the table according to their distance from the diagonal that indicates agreement. For the cell in row and column , with observed frequency , a weight is calculated as

因此,对角线上单元格权重为 1,而相差一个类别的单元格权重为 。对于 MAST-RAST 数据,差异为 0、1、2、3 和 4 的权重分别为 1、0.75、0.5、0.25 和 0。
Thus we give cells on the diagonal a weight of 1, while those where the difference is by one category get a weight of . For the MAST- RAST data weights for discrepancies of 0, 1, 2, 3 and 4 are thus 1, 0.75, 0.5, 0.25 and 0 respectively.

加权观察一致性比例和期望一致性比例计算如下
The weighted observed and expected proportional agreement are obtained as


and

加权卡帕系数计算公式为
and weighted kappa is given by

Fleiss(1981,第223页)展示了如何计算加权卡帕的标准误。
Fleiss (1981, p. 223) shows how to calculate the standard error of weighted kappa.

14.3.5 讨论 14.3.5 Discussion

与其他用于分析小型方形频数表的方法类似,卡帕系数的使用和解释存在一定困难。最常被提及的问题是卡帕值依赖于各类别中受试者的比例(患病率)。这一点通过一个简单的人工示例可以清楚地看出,该示例只有两个类别。表14.4显示了两个表格,它们的比例一致性均为0.8,但两个类别(+和-)的比例不同,且卡帕值差异显著。差异的原因在于预期的偶然频数差异很大,如表14.5所示。卡帕系数的这一性质导致在不同研究中比较卡帕值时,如果类别患病率不同,则容易产生误导。对于更大的表格情况也是如此,但判断可比性更加复杂。
As with other methods of looking at small, square frequency tables, there are difficulties associated with the use and interpretation of kappa. The most often cited problem is that the value of kappa depends upon the proportion of subjects (prevalence) in each category. This can be seen most clearly using a simple artificial example, where we have only two categories. Table 14.4 shows two tables with the same proportional agreement of 0.8, but with different proportions in the two categories (+ and - ) and with markedly different values of . The reason for the difference is that the chance expected frequencies are very different, as shown in Table 14.5. The consequence of this property of is that it is misleading to compare values of from different studies where the prevalences of the categories differ. For larger tables the same is true, but it is even more complicated to judge comparability.

表14.4 两位观察者诊断结果比较,两个类别的患病率不同(a)
Table 14.4 Comparison of two observers' diagnoses with different prevalences in the two categories (a)

观察者1总计
+-
观察者2+701080
-101020
总计8020100
Observer 1Total
+-
Observer 2+701080
-101020
Total8020100


(b)
(b)

观察者 1总计
+-
观察者 2+401050
-104050
总计5050100
Observer 1Total
+-
Observer 2+401050
-104050
Total5050100


表14.5 与表14.4(a)数据对应的期望频数
Table 14.5 Expected frequencies corresponding to the data in Table 14.4 (a)

观察者 1总计
+-
观察者 2+641680
-16420
总计8020100
Observer 1Total
+-
Observer 2+641680
-16420
Total8020100

(b)
(b)

观察者 1总计
+-
观察者 2+252550
-252550
总计5050100
Observer 1Total
+-
Observer 2+252550
-252550
Total5050100

另一个问题是 依赖于类别数。表14.3中的数据可以合并为三个类别,而非五个类别:0,1或2,3或4。对于得到的 表,我们计算出 ,相比之下,完整的 表为 。如果考虑这些方法实际上只用于将样本分类为阴性(0)或阳性(1、2、3或4),我们可以将数据折叠成一个 表,其 ,虽然不是很理想,但比 要好。
Another problem is that depends on the number of categories. The data in Table 14.3 can be grouped into three rather than five categories; 0, 1 or 2, 3 or 4. For the resulting table we find , compared with for the full table. If we consider that the methods are really only going to be used to categorize samples as negative (0) or positive (1, 2, 3 or 4) we can collapse the data into a table, for which , not wonderful but better than .

尽管存在这些缺点, 的使用在类似上述例子的资料中越来越普遍。它无疑是正确的分析方法。然而,错误的分析仍然很常见。MAST-RAST 数据曾通过计算相关系数进行分析(Brostoff 等,1984)。作者根据 的值得出方法结果相似的结论,并推荐使用更简单且更便宜的 MAST 方法。Pearson 相关系数不仅不适用于序数数据,而且如第14.2.4节所述,它也不是判断一致性的合适方法。他们的结论与表14.3中的数据不符。同样,使用 检验来判断一致性也是错误的,因为它仅是关联性检验。 统计量可解释为机会校正后的比例一致性,是解决此类问题的最佳方法,但如果可能,必须展示原始数据。可接受的一致性取决于具体情况。没有任何 值能被普遍视为良好一致性的标准—统计学无法替代临床判断。
Despite these shortcomings, the use of kappa is becoming common for data like the examples discussed. It is undoubtedly the right type of approach. Incorrect analyses of such data are still common, however. The MAST- RAST data were analysed by calculating the correlation coefficient (Brostoff et al., 1984). The authors concluded from the value of that the methods gave similar results and recommended the use of the simpler and cheaper MAST methods. Not only is Pearson's correlation coefficient unsuitable for ordinal data but, as we saw in section 14.2.4, it is an inappropriate approach to judge agreement. Nor is their conclusion compatible with the data shown in Table 14.3. Similarly, it would be incorrect to judge agreement by a test, which is also a test of association. The kappa statistic, which may be interpreted as the chance- corrected proportional agreement, is the best approach to this type of problem, but it is important to show the raw data if at all possible. Acceptable agreement depends upon the circumstances. There is no value of kappa that can be regarded universally as indicating good agreement - statistics cannot provide a simple substitute for clinical judgement.

14.4 诊断测试 14.4 DIAGNOSTIC TESTS

诊断是临床实践的重要组成部分,许多医学研究致力于改进诊断方法。这些研究的统计分析相对简单,但由于术语不熟悉且混乱,常常带来困难。
Diagnosis is an essential part of clinical practice, and much medical research is carried out to try to improve methods of diagnosis. The statistical analysis of these studies is fairly simple, but causes difficulty because of unfamiliar and confusing terminology.

最简单的情况是根据某项检查结果将患者分为两组,例如X光、活检,或某种症状或体征的有无。表14.6给出了一个例子,显示了肝脏扫描结果与基于尸检、活检或手术检查的诊断之间的关系。这里关注的问题是肝脏扫描在诊断异常病理方面的准确性。虽然我们可以使用第14.3节描述的方法简单计算两种分类的一致性,但该问题不同,因为两种分类之间的关系存在不对称性。我们希望描述扫描诊断患者真实状态的能力。实际上,我们很少知道真相,因此评估测试时是相对于诊断而言的。这个区别将在第14.4.7节进一步讨论。
The simplest case to consider is that where patients can be classified into two groups according to the results of an investigation, perhaps an X- ray or biopsy, or the presence or absence of a symptom or sign. An example is given in Table 14.6, which shows the relation between the results of liver scans and diagnosis based on either autopsy, biopsy or surgical inspection. The question of interest here is how good is the liver scan at diagnosis of abnormal pathology. While we could simply calculate the agreement between the two classifications using the methods described in section 14.3, this problem is different because of the asymmetry of the relation between the two classifications. We wish to describe the ability of the scan to diagnose the true patient status. In practice we rarely know the truth, and so evaluate the test in relation to the diagnosis. This distinction is considered further in section 14.4.7.

410 医学研究中的一些常见问题
410 Some common problems in medical research

表14.6 344例患者肝脏扫描结果与诊断的关系(Drum和Christacapoulos,1972)
Table 14.6 Relation between results of liver scan and diagnosis in 344 patients (Drum and Christacapoulos, 1972)

肝脏扫描病理情况
异常(+)正常(-)总计
异常(+)23132263
正常(-)275481
总计25886344
Liver scanPathology
Abnormal (+)Normal (-)Total
Abnormal (+)23132263
Normal (-)275481
Total25886344

14.4.1 敏感性和特异性 14.4.1 Sensitivity and specificity

一种方法是计算肝脏扫描结果为正常和异常的患者中,被扫描“诊断”为相应状态的比例。术语阳性和阴性指的是感兴趣状态的存在或不存在,这里是异常病理。因此共有258例阳性和86例阴性。基于扫描的正确诊断比例分别为 。这两个比例名称相似,定义如下:
One approach is to calculate the proportions of patients with normal and abnormal liver scans who are likewise 'diagnosed' by the scan. The terms positive and negative refer to the presence or absence of the condition of interest, here abnormal pathology. Thus there are 258 positives and 86 negatives. The proportions of these two groups that have correct diagnoses based on the scan are thus and respectively. These two proportions have confusingly similar names which are formally defined as follows:

敏感性是指被测试正确识别的阳性比例;
Sensitivity is the proportion of positives that are correctly identified by the test;

特异性是指被测试正确识别的阴性比例。
Specificity is the proportion of negatives that are correctly identified by the test.

因此,我们可以说,基于所研究的样本,预计90%的病理异常患者会有异常(阳性)肝脏扫描,而63%的病理正常者会有正常(阴性)肝脏扫描。
We can thus say that, based on the sample studied, we would expect of patients with abnormal pathology to have abnormal (positive) liver scans, while of those with normal pathology would have normal (negative) liver scans.

乍一看,这些简单的计算似乎回答了所提出的问题,但这些问题远比表面复杂。我们只是从一个方向回答了问题。在临床实践中,已知的只有检测结果,因此我们想知道该检测预测异常的准确性。换句话说,有异常检测结果的患者中,真正异常的比例是多少?
At first sight these simple calculations appear to have answered the question posed, but there is more to these problems than meets the eye. We have answered the question from one direction only. In clinical practice the test result is all that is known, so we want to know how good the test is at predicting abnormality. In other words, what proportion of patients with abnormal test results are truly abnormal?

14.4.2 阳性预测值和阴性预测值 14.4.2 Positive and negative predictive values

诊断测试的全部意义在于用它来做出诊断,因此我们需要知道测试给出正确诊断(无论是阳性还是阴性)的概率。敏感性和特异性并不能提供这一信息。相反,我们必须从检测结果的角度来分析数据。
The whole point of a diagnostic test is to use it to make a diagnosis, so we need to know what the probability is of the test giving the correct diagnosis, whether it is positive or negative. The sensitivity and specificity do not give us this information. Instead we must approach the data from

在263名肝脏扫描异常的患者中,231名病理异常,正确诊断的比例为231/263 = 0.88。同样,在81名肝脏扫描正常的患者中,正确诊断的比例为54/81 = 0.67。这两个比例有更合适的名称,正式定义如下:
the direction of the test results. Of the 263 patients with abnormal liver scans 231 had abnormal pathology, giving the proportion of correct diagnosis as . Similarly, among the 81 patients with normal liver scans the proportion of correct diagnoses was . These two proportions are given more sensible names, which are formally defined as follows:

阳性预测值是指阳性检测结果患者中正确诊断的比例:
Positive predictive value is the proportion of patients with positive test results who are correctly diagnosed:

阴性预测值是指阴性检测结果患者中正确诊断的比例。
Negative predictive value is the proportion of patients with negative test results who are correctly diagnosed.

阳性预测值和阴性预测值直接评估了检测在实际中的实用性。不幸的是,分析还不能停止,因为还有一个关键方面未被上述计算体现,那就是异常的患病率。
The positive and negative predictive values give a direct assessment of the usefulness of the test in practice. Unfortunately, we still cannot stop the analysis because there is another essential aspect of the analysis to consider, which is invisible in the above calculations, and that is the prevalence of abnormality.

14.4.3 患病率的影响 14.4.3 The effect of prevalence

敏感性和特异性的缺点是它们不能以临床实用的方式评估检测的准确性。然而,它们的优点是不会受到异常患者比例(即患病率)的影响。这里假设我们知道患者的真实状态。有关此点的进一步讨论见第14.4.7节。
The disadvantage of the sensitivity and specificity is that they do not assess the accuracy of the test in a clinically useful way. They do have the advantage, however, that they are not affected by the proportion of subjects with the abnormality, which we call the prevalence. It is assumed here that we know the patients' true status. See section 14.4.7 for further comment on this point.

预测值则相反,临床上更有用,但非常依赖患病率。在肝脏扫描研究中,异常的患病率非常高,为 ,即正好是四分之三。在不同的临床环境中,异常的患病率会有很大差异。利用表14.6中的数据,我构建了表14.7,展示了在异常患病率为0.25的患者群体中预期的结果。表14.8则显示了这两种患病率数据的分析结果。
The predictive values, in contrast, are clinically useful but depend very strongly on the prevalence. In the liver scan study the prevalence of abnormality was very high, being ; that is, exactly three- quarters. In different clinical settings the prevalence of abnormality will vary greatly. Using the data in Table 14.6 I constructed Table 14.7 to show the results we would expect in a group of patients where the prevalence of abnormality is 0.25. Table 14.8 shows the analyses of the data for these

表14.7 基于表14.6数据,在异常患病率为0.25时肝脏扫描结果的预测影响
Table 14.7 Predicted effect on liver scan results of a prevalence of abnormality of 0.25, based on data in Table 14 6

肝脏扫描病理状态
异常(+)正常(-)总计
异常(+)7796173
正常(-)9162171
总计86258344
Liver scanPathology
Abnormal (+)Normal (-)Total
Abnormal (+)7796173
Normal (-)9162171
Total86258344

412 医学研究中的一些常见问题
412 Some common problems in medical research

表14.8 肝脏扫描数据在异常患病率为0.75和0.25时的分析
Table 14.8 Analysis of liver scan data with prevalences of abnormality of 0.75 and 0.25

患病率
0.750.25
敏感性0.900.90
特异性0.630.63
阳性预测值0.880.45
阴性预测值0.670.95
总正确预测率0.830.69
Prevalence
0.750.25
Sensitivity0.900.90
Specificity0.630.63
Positive predictive value0.880.45
Negative predictive value0.670.95
Total correct predictions0.830.69

这两种患病率的数据分析如上所示。如前所述,敏感性和特异性保持不变:这些计算基于表格的列,且不受各列患者比例的影响。相反,测试的预测值基于行计算,且因异常患病率的不同而发生显著变化。表14.6和表14.7中的数据差异在图14.4中得到了直观展示。
two prevalences. As noted, the sensitivity and specificity are unchanged: these calculations are made on the columns of the table, and are not affected by the proportion of patients in each column. In contrast the predictive values of the test are based on the rows, and have changed a lot because they are affected by the prevalence of abnormality. The contrast between the data in Tables 14.6 and 14.7 is illustrated in Figure 14.4.

患病率降低的影响符合预期:真实异常越少见,阴性测试结果越能确定无异常,而阳性结果的确诊可靠性则降低。
The effect of a lower prevalence is much as we would expect: the more uncommon is true abnormality the more sure we can be that a negative test indicates no abnormality, and the less sure that a positive result really


图14.4 图示(a)表14.6和(b)表14.7。P表示病理状态,表示测试。敏感性由区域中标记为的比例表示,两图相同。同样,特异性由区域中标记为的比例表示,也相同。相反,阳性预测值是标记为的区域中的比例,两图差异显著。阴性预测值亦同理。
Figure 14.4 Graphical illustration of (a) Table 14.6 and (b) Table 14.7. P indicates the pathology and indicates the test. The sensitivity is depicted by the proportion of the area that is labelled , and is the same in both figures. Likewise the specificity is the proportion of the area that is labelled , and this is the same in both figures. Conversely, the PPV is the proportion of the area labelled that is , and is markedly different for the two figures. The same applies to the NPV.

表示异常患者。因此,测试的预测值依赖于被检测患者中异常的患病率,而这一点可能未知。我们不应将样本中观察到的预测值视为普适适用。
indicates an abnormal patient. The predictive values of a test thus depend upon the prevalence of the abnormality in the patients being tested, which may not be known. We should not take the predictive values observed in the sample as applying universally.

14.4.4 基于连续测量的诊断 14.4.4 Diagnosis based on a continuous measurement

到目前为止,我考虑的是根据某种症状或检测结果的有无来判断某种异常存在与否的情况。另一种常见情况是使用连续测量值进行诊断。我这里排除诸如高血压、贫血以及可能的肥胖等由连续测量值定义的疾病。我们可能只有单次测量值,或是由两个或多个不同测量值组合得出的评分。在这种情况下,基于逻辑回归的判别分析(第12.5.2节)与诊断测试方法之间的界限变得模糊,诊断与预后之间的界限也同样如此。
So far I have considered the case where we wish to determine the presence or absence of some abnormality on the basis of the presence or absence of some symptom or test result. Another common situation arises when the diagnosis is to be made using a continuous measurement. I exclude here conditions such as hypertension, anaemia and perhaps obesity, which are defined by the value of a continuous measurement. We may have a single measurement or a score derived from combining two or more different measurements. Here the distinction between discriminant analysis based on logistic regression (section 12.5.2) and the methodology of diagnostic tests becomes decidedly blurred, as does that between diagnosis and prognosis.

表14.9显示了艾滋病患者和健康献血者中HTLV-III(现称HIV)抗体检测的结果。如果我们希望用该检测诊断HIV血清阳性,则需要选择一个合适的截断值。对于每一个可能的截断值,我们都可以计算该检测的敏感性和特异性,也可以针对任意血清阳性率计算阳性预测值和阴性预测值。后一种计算方法详见第14.4.5节。
Table 14.9 shows results of an HTLV- III (now HIV) antibody assay among patients with AIDS and healthy blood donors. If we wish to use the test to diagnose HIV seropositivity then we need to choose an appropriate cut- off. For each possible cut- off we can calculate the sensitivity and specificity of the test, and we can also calculate the positive and negative predictive values for any prevalence of seropositivity. The method for this last calculation is given in section 14.4.5.

表14.10展示了HTLV-III抗体检测结果的这些计算。预测值的计算假设艾滋病的患病率分别为10%和1%,以说明患病率对预测值的影响。
Table 14.10 shows these calculations for the HTLV- III antibody assay results. Predictive values have been calculated assuming the prevalence of AIDS to be either or to illustrate the effect of the prevalence on

表14.9 艾滋病患者和健康献血者中HTLV-III酶联免疫吸附测定(ELISA)结果(Weiss等,1985)。结果以测试样本对的平均吸光度与八个阴性对照孔平均吸光度的比值表示。
Table 14.9 Results of enzyme-linked immunosorbent assay (ELISA) for HTLV-III among patients with AIDS and healthy blood donors (Weiss et al., 1985). (Results expressed as the ratio of the mean absorbance of a pair of test samples divided by the mean absorbance of eight negative control wells)

比值健康献血者艾滋病患者
< 2.0202 (68%)0 (0%)
2.0–2.9973 (25%)2 (2%)
3.0–3.9915 (5%)7 (8%)
4.0–4.993 (1%)7 (8%)
5.0–5.992 (1%)15 (17%)
6.0–11.992 (1%)36 (41%)
12.0 以上0 (0%)21 (24%)
总计297 (100%)88 (100%)
RatioHealthy blood donorsPatients with AIDS
&lt; 2.0202 (68%)0 (0%)
2.0–2.9973 (25%)2 (2%)
3.0–3.9915 (5%)7 (8%)
4.0–4.993 (1%)7 (8%)
5.0–5.992 (1%)15 (17%)
6.0–11.992 (1%)36 (41%)
12.0 +0 (0%)21 (24%)
Total297 (100%)88 (100%)

表14.10 对表14.9数据计算的敏感性、特异性、阳性预测值(PPV)和阴性预测值(NPV)
Table 14.10 Calculations of sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for data in Table 14.9

比值截断点敏感性特异性HIV血清阳性率
PPVNPVPPVNPV
2.01.000.680.261.000.031.00
3.00.980.930.590.9970.120.9997
4.00.900.980.810.990.280.999
5.00.820.990.870.980.380.998
6.00.650.990.910.960.490.996
12.00.241.001.000.921.000.992
Cut-off for ratioSensitivitySpecificityPrevalence of HIV seropositivity
PPVNPVPPVNPV
2.01.000.680.261.000.031.00
3.00.980.930.590.9970.120.9997
4.00.900.980.810.990.280.999
5.00.820.990.870.980.380.998
6.00.650.990.910.960.490.996
12.00.241.001.000.921.000.992

预测值的计算。没有理由使用研究数据中的患病率(23%),因为两个样本组是独立选择的,这个比例没有实际意义。应使用的患病率取决于所研究人群的特征。
predictive values. There is no reason to use the prevalence in the study data which has no meaning because the two samples of subjects were selected independently. The appropriate figure to use will depend upon the characteristics of the population being studied.

截断点的选择不是统计学决策。假设表14.10中的数值表明该检测在临床上有用,那么“最佳”截断点应根据假阳性和假阴性结果所带来的相对代价(不一定是经济上的)来选择。这又与阳性检测后将采取的临床措施有关,特别是该检测是筛查测试还是诊断测试(见第14.4.7节)。然而,如下文所示,并不总是必须设定截断点。是否需要设定截断点取决于目的是诊断还是预后,这同样不是统计学问题。
The choice of a cut- off is not a statistical decision. Assuming that it is felt that the values in Table 14.10 show that the test is clinically useful, then the 'best' cut- off must be chosen according to the relative costs (not necessarily financial) associated with a false positive and false negative test results. This in turn will be related to the clinical action that will follow a positive test, in particular whether the test is a screening test or a diagnostic test (see section 14.4.7). It is not always necessary, however, to impose a cut- off, as we will see below. The need to do so depends on whether the aim is to make a diagnosis or a prognosis. Again, this is not a statistical issue.

我们可以通过多元回归分析的结果达到类似的情况。正如我们在第12.4.8节中看到的,回归模型可以用来导出一个连续的评分或预后指数。当结果变量是二元的且使用逻辑回归时,该预后指数可以转换为该结果存在(或不存在)的概率。第12.5.2节中我描述了逻辑回归在判别问题中的应用。使用相同模型进行诊断是一个小的跳跃;事实上,这两个概念可以说是相同的。
We can arrive at a similar situation with the results of a multiple regression analysis. As we saw in section 12.4.8 a regression model can be used to derive a continuous score or prognostic index. When the outcome variable is binary and logistic regression is used, that prognostic index can be converted into a probability of the presence (or absence) of that outcome. In section 12.5.2 I described the application of logistic regression to the problem of discrimination. It is a small jump to the use of the same model for diagnosis; indeed, the two concepts are arguably the same.

下一节将更详细地审视这些计算。
In the next section the calculations are examined more closely.

14.4.5 计算 14.4.5 Calculations

表14.11展示了基于二元指标(如某特定症状的有无)的任何诊断测试的一般表示。
Table 14.11 shows a general representation of any diagnostic test based on a binary indicator, such as the presence or absence of a particular symptom

表14.11 诊断测试的一般表示
Table 14.11 General representation of a diagnostic test

疾病状态
阳性阴性总计
测试阳性aba + b
阴性cdc + d
总计a + cb + dn
Disease status
PositiveNegativeTotal
TestPositiveaba + b
Negativecdc + d
Totala + cb + dn

或测试结果。我们可以给这四个单元格命名:
or test result. We can give names to the four cells:

测试疾病状态
++真阳性 (a)
+-假阳性 (b)
-+假阴性 (c)
--真阴性 (d)
TestDisease statusName
++True positive (a)
+-False positive (b)
-+False negative (c)
--True negative (d)

之前定义和讨论的量是
The quantities defined and discussed earlier are

阳性预测值
Positive predictive value

阴性预测值
Negative predictive value

“假阳性率”和“假阴性率”这两个术语有时会被使用,但这些名称存在歧义。例如,假阴性率可能是 ,这取决于你的视角。
The terms false positive rate and false negative rate are sometimes used, but these names are ambiguous. For example, the false negative rate might be or , depending on your point of view.

研究中观察到的疾病患病率是 。如果研究是在一个可定义的患者群体中进行的,比如某个特定门诊的患者,那么该患病率可能有用,基于该患病率计算的阳性预测值和阴性预测值也可能有用。然而,更一般地说,我们可能希望考虑该检测对其他患病率群体的预测能力,比如不同年龄组甚至普通人群。这些计算依赖于贝叶斯定理,即
The observed prevalence of disease in the study is . If the study is carried out on a definable group of patients, such as those attending a particular clinic, then the prevalence may be useful, as may the calculation of positive and negative predictive values based on that prevalence. More generally, however, we may wish to consider the predictive ability of the test for groups with other prevalences of disease, such as different age groups or even the general population. These calculations depend upon Bayes' theorem, which is that

Prob(disease|test positive) = Prob(test positive|disease) × Prob(disease) / Prob(test positive) = Prob(test positive|disease) × Prob(disease) + Prob(test positive|no disease) × Prob(no disease)
Prob(disease|test positive) = Prob(test positive|disease) × Prob(disease) Prob(test positive) = Prob(test positive|disease) × Prob(disease) + Prob(test positive|no disease) × Prob(no disease)

其中 Prob(disease|test positive) 表示检测呈阳性时患病的概率,依此类推。
where Prob(disease|test positive) means the probability of disease when the

根据之前的定义,可以明确的是
test is positive, and so on. From the earlier definitions it is clear that

Prob(disease) 疾病患病率
Prob(disease) prevalence of disease

Prob(disease|test positive) 阳性预测值(PPV)
Prob(disease|test positive) positive predictive value (PPV)

Prob(test positive|disease) 敏感性
Prob(test positive|disease) sensitivity

Prob(test positive| no disease) - 特异性
Prob(test positive| no disease) - specificity

因此,我们可以将上面关于测试阳性时疾病概率的方程重写为
so that we can rewrite the above equation for the probability of disease when the test is positive as

敏感性 患病率 PPV = 敏感性 患病率 + (1 - 特异性) (1 - 患病率)
sensitivity prevalence PPV = sensitivity prevalence (1 - specificity) (1 - prevalence)

通过类似的推理,我们可以得出阴性预测值(NPV)为
By a similar argument we can show that the negative predictive value (NPV) is

特异性 (1 - 患病率) NPV = (1 - 敏感性) 患病率 + 特异性 (1 - 患病率)
specificity (1 - prevalence) NPV = (1 - sensitivity) prevalence specificity (1 - prevalence)

这两个公式带来了两个明显的结论。首先,对于任何疾病患病率,都可以简单地估计预测值。患病率的变化可能产生显著影响,如表14.10所示。其次,如果我们对患病率一无所知,就无法估计测试的预测值。另一种解释患病率的方法是,将其视为测试前受试者患病的概率,即疾病的先验概率。PPV和的值是对测试阳性和阴性受试者相应概率的修正估计,称为后验概率。先验概率与后验概率之间的差异,是评估测试有用性的一种方式。
Two consequences of these formulae are clear. Firstly, it is simple to estimate the predictive values for any prevalence of disease. The effect of varying the prevalence can be marked, as is seen in Table 14.10. Secondly. if we have no idea of the prevalence we cannot estimate the predictive value of the test. Another way of interpreting the prevalence is as the probability before the test is carried out that the subject has the disease. known as the prior probability of disease. The values of PPV and are the revised estimates of the same probability for those subjects who are positive and negative to the test, and are known as posterior probabilities. The difference between the prior and posterior probabilities is one way of assessing the usefulness of the test.

我们可以将这些思想扩展到基于连续测量的诊断,通过依次考虑每一个可能的截断点。表14.10展示了测定结果与HIV血清阳性之间关联的这一过程。
We can extend these ideas to diagnosis based on a continuous measurement, by considering each possible cut- off in turn. Table 14.10 illustrated the procedure for the association between assay results and HIV seroposi tivity.

敏感性和特异性是比例值,因此我们可以使用第10.2.1节的方法计算它们的置信区间。当在同一样本中比较两种诊断测试时,敏感性和特异性是配对的,因此应使用相应的置信区间(第10.4.1节)和McNemar检验(第10.7.5节)。
The sensitivity and specificity are proportions, and so we can calculate confidence intervals for them using the methods of section 10.2.1. When two diagnostic tests are compared on the same sample of individuals, the sensitivities and specificities are paired and so the appropriate confidence interval (section 10.4.1) and the McNemar test (section 10.7.5) should be used.

14.4.6 诊断测试的另外两种视角 14.4.6 Two further ways of looking at diagnostic tests

(本节可省略,不影响连贯性。)
(This section can be omitted without loss of continuity.)

诊断测试数据表面上看似简单,尤其是以2×2表格呈现时,但结果的表达方式却多种多样。
The apparent simplicity of diagnostic test data, particularly when presented as a 2 by 2 table, is belied by the many ways of expressing the results.

这里我考虑另外两种比单纯观察敏感性和特异性更具信息量的方法。
Here I consider two further approaches that are more informative than simply looking at sensitivity and specificity.

(a) 似然比 (a) The likelihood ratio

对于任何检测结果,我们可以比较患者确实患有相关疾病时获得该结果的概率与其健康时获得该结果的概率。两者概率的比值称为似然比(LR),计算公式为
For any test result we can compare the probability of getting that result if the patient truly had the condition of interest with the corresponding probability if they were healthy. The ratio of these probabilities is called the likelihood ratio (LR), and it is calculated as

我们可以将似然比视为增加对阳性诊断确定性的检测价值。患病率是检测前患病的概率。因此患病的赔率为患病率/(1 - 患病率)。例如,若患病率为 ,则赔率为0.11,即患病的概率为1比9。我们称此为检测前赔率,阳性预测值对应的赔率为检测后赔率。数学上不难证明
We can consider the likelihood ratio as indicating the value of the test for increasing certainty about a positive diagnosis. The prevalence is the probability of disease before the test is performed. The odds of having the disease are thus given as prevalence/(1 - prevalence). Thus if the prevalence is , the odds are 0.11, or 9 to 1 against the disease being present. We can call this figure the pre- test odds, and the odds corresponding to the positive predictive value as the post- test odds. It is not difficult mathematically to show that

检测后赔率 检测前赔率 似然比
Post- test odds pre- test odds likelihood ratio

这表明似然比衡量了诊断确定性的变化。
demonstrating how the likelihood ratio measures the change in certainty of diagnosis.

以表14.6数据为例,异常病理的患病率为0.75,故检测前患病赔率为 。阳性检测时的检测后患病赔率为 ,似然比为 ,验证了这三者的关系 。表14.7的数据中似然比相同,但检测前患病赔率为 。检测后赔率可得
For the data in Table 14.6 the prevalence of abnormal pathology is 0.75, so the pre- test odds of disease are . The post- test odds of disease given a positive test are , and the likelihood ratio is , demonstrating the stated relation between these three quantities . For the data in Table 14.7 the likelihood ratio is the same, but the pre- test odds of disease are . We can obtain the post- test odds as .

这种方法可能对诊断测试数据的解释提供更多见解,但并未引入新信息,因为使用的量与之前相同。正如我刚才所示,高似然比可能表明检测有用,但不一定意味着阳性检测是疾病存在的良好指标。表14.7的数据中,低患病率0.25意味着阳性检测者仍更可能是正常的—这从检测后赔率0.81和阳性预测值0.45均可看出。然而,使用赔率而非概率可能更有助于理解,特别是在评估似然比所反映的检测价值时(Ingelfinger等,1987,第25页)。
This approach may give further insight into the interpretation of diagnostic test data, but it does not add new information because the same quantities are used as before. As I have just shown, a high likelihood ratio may demonstrate that the test is useful but it does not necessarily indicate that a positive test is a good indicator of the presence of disease. For the data in Table 14.7, the low prevalence of 0.25 means that someone with a positive test is still more likely to be normal than abnormal - this is seen from both the post- test odds of 0.81 and the PPV of 0.45. Using odds rather than probabilities may be helpful, however, especially for seeing the usefulness of the test as assessed by the likelihood ratio (Ingelfinger et al., 1987, p. 25).

(b) ROC曲线 (b) ROC curve

当使用某项测量进行诊断时,选择“最佳”临界值并不简单。一种图形方法是绘制灵敏度对1 - 特异性的曲线,并将各点连接起来。由此得到的曲线称为“受试者工作特征”曲线,简称ROC曲线,因为该方法起源于雷达操作员的信号检测研究。对于表14.10中的数据,曲线将基于第二和第三列。然而,由于特异性非常高,ROC曲线对这些数据帮助不大,因为“曲线”几乎沿着轴。若假阴性结果的“代价”与假阳性结果相同,则最佳临界值是使灵敏度和特异性之和最大的点,即最接近左上角的点。若代价不同,则从图中难以确定最佳点。
When a measurement is used to make a diagnosis the choice of the 'best' cut- off is not simple. A graphical approach is to plot the sensitivity versus 1 - specificity for each possible cut- off, and to join the points. The curve thus obtained is known as a 'receiver operating characteristic' curve or ROC curve, because the method originated in studies of signal detection by radar operators. For the data in Table 14.10 the curve would thus be based on the second and third columns. However, the ROC curve is not very helpful for these data because the specificities are so high that the 'curve' follows the axis. If the 'cost' of a false negative result is the same as that of a false positive result, the best cut- off is that which maximizes the sum of the sensitivity and specificity, which is the point nearest the top left- hand corner. With different costs it is hard to note the best point from the graph.

ROC方法在比较两个或多个竞争方法时可能最为有用。对于单一测试,它并未增加表格之外的信息,但当存在许多可能的临界值时,ROC曲线更为合适。当然,ROC曲线仅基于灵敏度和特异性,不考虑所检测疾病的患病率。
The ROC method is perhaps most useful when comparing two or more competing methods. For a single test it does not add anything to a table but it is preferable when there are many possible cut- off values. Of course, the ROC curve, being based only on sensitivity and specificity, takes no account of the prevalence of the disease being tested for.

14.4.7 患者的真实状况是什么? 14.4.7 What is the patient's true condition?

在14.4.3节中,我指出从样本中计算的灵敏度和特异性与异常患病率无关。但情况不一定总是如此。我们可以从三种方式来分类患者—他们的真实状况、诊断结果和检测结果。当我们计算检测的灵敏度和特异性时,是相对于诊断而言的,但我们不一定知道诊断总是正确的。除非诊断是完美的,始终反映患者的真实状态(阳性或阴性),否则我们评估的是检测预测诊断的能力,而非患者的真实疾病状态。在这种情况下,检测相对于真实状态的灵敏度和特异性与异常患病率相关(Begg,1987)。这表明,除非已知诊断几乎总是正确,否则评估诊断测试时,应选择患病率与未来使用该测试的患者相同的样本。
In section 14.4.3 I observed that the sensitivity and specificity calculated from a sample of subjects are unrelated to the prevalence of abnormality. This may not always be the case. We can consider three ways of categorizing a patient - their true condition, the diagnosis, and the test results. When we calculate the sensitivity and specificity of the test we do this in relation to the diagnosis, but we do not necessarily know that the diagnosis is always correct. Unless the diagnosis is perfect, so that it always gives the patient's true status (positive or negative), we are evaluating the test's ability to predict the diagnosis rather than the patient's true disease status. In this case, the sensitivity and specificity of the test in relation to the true state are related to the prevalence of abnormality (Begg, 1987). This suggests that unless it is known that the diagnosis is almost always correct, it is wise to evaluate a diagnostic test on patients with the same prevalence of disease as those for whom the test will be used in future.

14.4.8 讨论 14.4.8 Discussion

诊断测试数据的分析不需要复杂的数学知识。主要的难点不在于统计学,而在于决定测试需要达到多好的水平才能具有临床价值。这个问题的答案与被检测对象中疾病的流行率有关。两个极端情况是:一是在三级转诊中心对高风险个体进行检测;另一是在表面健康的人群中筛查罕见严重疾病的早期迹象,比如宫颈癌。
The analysis of data from diagnostic tests requires no complicated mathematics. The main difficulty is not statistical, but rather the need to decide how good the test should be to be clinically valuable. The answer to this question is related to the prevalence of the disease in the subjects being tested. Two extremes are when we are testing high risk individuals. perhaps in a tertiary referral centre, and when we are screening an ostensibly healthy population for early signs of rare serious disease, such as

对于筛查测试来说,高特异性和阴性预测值(NPV)非常重要。我们不希望出现假阴性结果,并且愿意接受适度数量的假阳性结果。所有筛查测试呈阳性的个体随后通常会接受另一种不同的测试。此时要求测试具有高敏感性和阳性预测值(PPV),因为阳性结果很可能导致疾病诊断和临床干预。当然,高特异性也是理想的。HIV血清阳性检测是一个很好的例子,假阳性诊断对患者会产生重大影响,而假阴性诊断对接受输血者同样有严重后果。另一个例子是通过羊水穿刺测定甲胎蛋白水平以检测唐氏综合征胎儿。在决定表14.9中阳性与阴性诊断的界限,或者是否应设定任何界限时,必须仔细权衡这些问题。
cervical cancer. For screening tests it is very important to have high specificity and NPV. We do not want false negative results and are willing to accept a moderate number of false positive results. All those positive to the screening test will then be tested again, usually with a different test. Here the requirement will be a high sensitivity and PPV, because a positive result will probably lead to a diagnosis of disease and clinical intervention. A high specificity is also desirable, of course. The detection of HIV seropositivity is a good example of the case where the importance of a false positive diagnosis would have major consequences for the patient and so would a false negative diagnosis for someone receiving their blood in a transfusion. Another is the use of alpha- fetoprotein levels from amniocentesis to detect fetuses with Down's syndrome. These issues must be carefully weighed up when deciding where to put the cut- off between positive and negative diagnosis in the data in Table 14.9 or, indeed, whether it is wise to impose any cut- off.

一个可以更频繁采用的方法是利用诊断测试将受试者分为三组,其中中间组为“不确定”组,这部分人需接受进一步检测。对于表14.9中的数据,Weiss 等人(1985年)将检测结果在3.0到5.0之间视为“边缘”。
One approach that could be adopted more frequently is to use the diagnostic test to divide subjects into three groups, with a central, 'uncertain' group who would be subjected to further testing. For the data shown in Table 14.9 Weiss et al., (1985) considered assay results between 3.0 and 5.0 as 'borderline'.

最后,与本章前面部分相关的是,一个好的诊断测试要求结果具有重复性,并且观察者间的变异最小。
Finally, a link with the earlier sections of this chapter is that it is a requirement of a good diagnostic test that the result is repeatable and is subject to minimal inter- observer variation.

关于诊断测试的方法学和解释的更多讨论,可以参见 Sheps 和 Schechter(1984年)的论文,麦克马斯特大学临床流行病学与生物统计学系(1983年)的一系列文章,以及 Galen 和 Gambino(1975年)和 Ingelfinger 等人(1987年)的著作。Macartney(1987年)回顾了临床诊断的逻辑及计算机应用。
Further discussion of the methodology and interpretation of diagnostic tests can be found in the paper by Sheps and Schechter (1984), the series of articles from the Department of Clinical Epidemiology and Biostatistics at McMaster University (1983) and in the books by Galen and Gambino (1975) and Ingelfinger et al. (1987). The logic of clinical diagnosis and computer applications are reviewed by Macartney (1987).

14.5 参考区间 14.5 REFERENCE INTERVALS

诊断测试利用患者数据将个体分类为正常或异常。一个相关的统计问题是描述正常个体的变异性,以提供评估其他个体测试结果的依据。呈现此类数据最常见的形式是一个数值范围或区间,涵盖了大多数正常样本个体的测量值。参考区间常被称为正常范围或参考范围。“参考区间”是更合适的术语,因为它既避免了与统计学中“正常”的混淆,也因为“范围”一词暗示被排除的数值按定义即为异常。
Diagnostic tests use patient data to classify individuals as either normal or abnormal. A related statistical problem is the description of variability in normal individuals, to provide a basis for assessing test results for other individuals. The most common form of presenting such data is as a range of values, or interval, which encompasses the values obtained from the majority of a sample of normal subjects. The reference interval is often referred to as a normal range or reference range. 'Reference interval' is a better term, both because it avoids confusion with Normal in the statistical sense, and also because the word 'range' suggests that values excluded are by definition abnormal.

参考区间最常用于临床化学,例如用于提供一个标准参考,以评估胆固醇水平。
Reference intervals are used most often in clinical chemistry, for example to provide a standard reference against which to assess cholesterol

来自受检患者血液样本中的水平。与诊断测试类似,所需的计算本质上很简单,大多数问题都与解释有关。需要注意的一点是,该程序等同于一种诊断测试,我们已知特异性(通常为 ),但除此之外一无所知。显然,这类信息不应单独用于做出诊断。关于参考区间概念的详细讨论见Solberg(1987)及其引用的文献。
levels in blood samples from patients under investigation. As with diagnostic tests the calculations required are essentially simple and most of the problems are associated with interpretation. One point to note is that the procedure is equivalent to a diagnostic test where we know the specificity (usually or ) but nothing else. Clearly such information should not be used on its own to make a diagnosis. Detailed discussion on the concepts of reference intervals are given in Solberg (1987) and the papers cited therein.

14.5.1 选择样本 14.5.1 Selecting a sample

“正常性”的概念难以捉摸,任何定义都将依赖于具体的背景。参考区间通常来源于医院中采集的样本,这些样本来自后来被确定为未患重病的个体,但医院中的人并不代表健康人群的“正常”状态。必须明确说明参考对象是如何选取的,以及其健康状况是基于何种标准确定的。
The concept of 'normality' is elusive, and any definition will be specific to the context. Reference intervals are often derived from samples taken in hospital from subjects subsequently found not to be seriously ill, but people in hospital are not normal in the sense of being representative of the healthy population. It is essential to describe how the reference subjects were selected and on what basis their health was determined.

样本量也是一个重要的考虑因素,详见第14.5.3节。此外,不同受试者群体间测量指标的分布可能存在差异。尤其常常需要分别计算男性和女性的参考区间。年龄也常常带来变异,特别是在儿童中;这一话题在第14.5.4节中讨论。
Sample size is also an important consideration, and is discussed in section 14.5.3. Also there may be variation in the distribution of the measurement of interest between different groups of subjects. In particular it is frequently necessary to calculate separate intervals for males and females. There is often also variation by age, especially among children; this topic is considered in section 14.5.4.

14.5.2 计算参考区间 14.5.2 Calculating the reference interval

参考区间仅是估计的范围,包含相关人群中某一百分比的测量值。与前面章节讨论的其他区间类似,参考区间通常涵盖90%、95%或99%的值,其中95%最为常用。无论是两端极值均被视为异常,还是仅一端极值被关注,计算方法相同。
The reference interval is simply the estimated range of values that includes a certain percentage of the values among the relevant population. As with other intervals discussed in earlier chapters, reference intervals usually encompass , or of the values, with the most frequently used. The same method is used whether both low and high values are considered suspicious or only those at one extreme.

计算有两种基本方法。可以直接从观测值的经验分布中取适当的百分位数,或者使用正态分布,可能先对数据进行变换。例如,许多血清成分呈对数正态分布。因此,选项与第3.4节介绍的总结观测值分布的一般方法相同。该节中分析了298名0至6岁健康儿童的血清IgM值。在第3.4.2节中,计算得出第2.5和第97.5百分位数分别为0.2和2.0 g/l。由此,使用百分位法,0.2到2.0的范围定义了95%的参考区间。IgM分布偏斜(见图3.3),但呈对称分布(见图3.13),均值为-0.158,标准差为0.238。
There are two basic approaches to the calculation. We can either take the appropriate (per)centiles from the empirical distribution of the observations, or we can use the Normal distribution, perhaps after transforming the data. Many serum constituents, for example, have Lognormal distributions. The options are thus the same as for the general methods introduced in section 3.4 for summarizing the distribution of a set of observations. In that section the serum IgM values from 298 healthy children aged 0 to 6 years were analysed. In section 3.4.2 the th and th centiles were calculated as 0.2 and . The range of values from 0.2 and 2.0 thus defines a reference interval using the percentile method. The distribution of IgM was skewed (Figure 3.3) but had a symmetrical distribution (Figure 3.13), with mean and standard deviation .

如果我们可以认为 的分布接近正态分布,则可以使用标准正态分布来估计所需的分位数(参见第4.5.2节)。 参考区间计算为均值 ,然后对该值进行反对数变换,以得到 参考区间。我们首先计算
If we can consider the distribution of as close to Normal we can use the standard Normal distribution to estimate the required centiles (see section 4.5.2). The reference interval for is calculated as mean , and the values are antilogged to give the reference interval for . We thus calculate first

,然后使用第3.4节中提到的 反变换这些值,得到 参考区间为 。这两种方法对这些数据的结果非常接近。
that is, and , and back- transform these values (using as in section 3.4) to get a reference interval for as to . The two approaches give very similar answers for these data.

一如既往,替代方法各有优缺点,且各有支持者。参数法依赖于数据具有接近正态分布的特性,可能需要先进行变换。我们可以使用第7.5.3节描述的非正态性正式检验。图14.5中 数据的正态概率图显示数据确实接近正态分布。另一种分位数方法对数据不做假设,但当数据为正态时,其可靠性较低。
As always there are advantages and disadvantages of the alternative approaches and each has strong advocates. The parametric approach depends on the data having a closely Normal distribution, perhaps after transformation. We can use a formal test of non- Normality, as described in section 7.5.3. The Normal plot for the data in Figure 14.5 shows that the data are indeed close to a Normal distribution. The alternative percentile approach makes no assumptions about the data, but is less reliable when the data are Normal.


图14.5 儿童血清 对数数据的正态概率图(Isaacs 等,1983年)。
Figure 14.5 Normal plot of log serum data in children (Isaacs et al., 1983).

14.5.3 不确定性与样本量 14.5.3 Uncertainty and sample size

参数法基于均值和标准差的估计,而百分位数法则基于分布尾部的观测值。对于这两种方法,参考区间为
Whereas the parametric approach is based on estimates of the mean and standard deviation, the percentile approach is based on observations in the tails of the distribution. For both methods the reference interval is

获得的值为两个数值,且受抽样变异性的影响。同一健康人群的多个样本会产生不同的参考区间,其变异性取决于样本量。来自不同人群的样本变异性会更大,使用不同类型的仪器测量感兴趣的指标则会进一步增加变异性。表14.12展示了12个中心14个不同样本中胎儿头皮血液pH的均值及参考区间。使用了五种不同类型的pH计。参考区间存在显著差异,其中两个(编号3和14)的区间几乎不重叠。然而,最显著的是大多数研究样本量非常小,除一项外,均基于少于50名受试者。
obtained as two values which are subject to sampling variability. Several samples from the same population of healthy individuals will give different reference intervals, with the variability depending on sample size. Samples from different populations would be even more variable, and the use of different types of machine to measure the quantity of interest would increase variability further. Table 14.12 shows mean fetal scalp blood pH and reference intervals from 14 different samples of women in 12 centres. Five different types of pH meter were used. There is marked variation in the reference intervals with two (numbers 3 and 14) hardly overlapping. Most noticeable, however, is the fact that most of the studies are very small, all but one being based on fewer than 50 subjects.

表14.12 14项胎儿头皮血液pH研究的参考区间(Lumley等,1971)
Table 14.12 Reference intervals from 14 studies of fetal scalp blood pH (Lumley et al., 1971)

研究平均pH值95%参考区间*样本量
17.297.15 到 7.4343
27.297.21 到 7.3724
37.297.25 到 7.3310
47.307.20 到 7.4012
57.307.22 到 7.3818
67.307.22 到 7.38129
77.327.20 到 7.4416
87.327.22 到 7.4249
97.357.23 到 7.4745
107.357.25 到 7.4526
117.357.25 到 7.4529
127.357.25 到 7.4521
137.377.27 到 7.4745
147.387.30 到 7.4522
StudyMean pH95% reference interval*Sample size
17.297.15 to 7.4343
27.297.21 to 7.3724
37.297.25 to 7.3310
47.307.20 to 7.4012
57.307.22 to 7.3818
67.307.22 to 7.38129
77.327.20 to 7.4416
87.327.22 to 7.4249
97.357.23 to 7.4745
107.357.25 to 7.4526
117.357.25 to 7.4529
127.357.25 to 7.4521
137.377.27 to 7.4745
147.387.30 to 7.4522

*均值 ± 2倍标准差
*mean 2SD

标准误差可以用于估计正态分布任意百分位数的标准误差。例如,描述 参考区间的值的标准误差为
The standard error may be obtained for any estimated centile of the Normal distribution. For example, the values describing a reference interval have a standard error of

其中 是观测值的标准差。该值大致等于 。不同样本量下 参考区间边界的置信区间宽度见图14.6。对于样本量小于约50的情况,定义参考区间的值本身的置信区间比标准差更宽。
where is the standard deviation of the observations. This is approximately equal to . The widths of confidence intervals for the limits of reference intervals for different sample sizes are shown in Figure 14.6. For sample sizes smaller than about 50 the values defining the reference interval themselves have a confidence interval wider than the standard


图14.6 若数据服从正态分布,参数法计算的 参考区间边界置信区间宽度相对于标准差的倍数。
Figure 14.6 Width of parametric confidence interval for limits of reference interval as a multiple of the standard deviation if the data have a Normal distribution.

观测值的偏差。为了减少不确定性,我们需要更大的样本,最好至少有200个观测值。通过非参数百分位数方法得出的参考区间,其置信区间远比图14.6中显示的要宽(Linnet, 1987)。因此,如果我们能使数据近似符合正态分布,参数方法要好得多,除非样本量非常大。
deviation of the observations. In order to reduce the uncertainty we need much larger samples, preferably of at least 200 observations. Reference intervals derived by the non- parametric percentile method have confidence intervals that are much wider than those shown in Figure 14.6 (Linnet, 1987). The parametric approach is therefore much better if we can make the data conform closely to a Normal distribution, unless we have a very large sample.

14.5.4 与年龄的关系 14.5.4 Relation to age

许多临床和生化变量在健康个体中随年龄变化。例如,随着年龄增长,血压往往升高,体重也倾向于增加。儿童期尤其容易观察到随年龄的变化,孕期母亲和胎儿亦是如此。调查与年龄可能的关系非常重要,尤其是针对儿童或孕期的测量。忽视这一点可能导致错误地发现异常发生率随年龄变化的假象。
Many clinical and biochemical variables vary with age in healthy individuals. For example, as people get older their blood pressure tends to rise and they tend to put on weight. During childhood we are especially likely to find changes with age, and the same applies to both mother and fetus during pregnancy. It is important to investigate possible relations with age, especially for measurements on children or during pregnancy. Failure to do so may lead to the finding of a spurious change in prevalence of abnormality with age.

不仅均值,标准差也可能随年龄变化。此外,正态性评估需要针对小年龄组进行。可以使用回归拟合均值的曲线,如有必要,也可分别拟合标准差的曲线。这些分析的残差应与年龄无关。对6个月至6岁儿童的IgM数据的仔细分析显示,log IgM的均值和标准差在5又1/2年期间略有增加后又下降。对6个月年龄组的log IgM均值和标准差分别拟合了二次回归线。
Not only the mean but also the standard deviation may vary with age. Further, the assessment of Normality needs to be made for small age groups. Regression can be used to fit a curve to the means and, if necessary, a separate curve to the standard deviations. The residuals from these analyses should show no relation to age. Careful analysis of the IgM

这两条曲线随后被结合起来,给出每个年龄的均值±1.96标准差,所有数据经过反对数转换,得到了图14.7所示的与年龄相关的参考区间。
data from children aged 6 months to 6 years showed that both the mean and standard deviation of log IgM increased slightly and then decreased in the year period. Quadratic regression lines were fitted separately to the mean and SD of log IgM for 6 month age groups. These two curves were then combined to give mean at each age, and everything was antilogged to give the age- related reference interval shown in Figure 14.7.


图14.7 IgM的95%年龄相关参考区间(Isaacs等,1983年)。
Figure 14.7 95% age-related reference interval for IgM (Isaacs et al., 1983).


图14.8 首胎男婴出生体重的百分位数(Altman和Coles,1980年),显示了经验(原始)百分位数和回归模型推导的曲线。
Figure 14.8 Centiles for birthweight of first-born male babies (Altman and Coles. 1980), showing empirical (raw) centiles and curves derived from regression models.

方法的更多细节见原始论文(Isaacs等,1983年)。
Further details of the method are given in the original paper (Isaacs et al., 1983).

构建胎儿或儿童生长“标准”时也会遇到完全相同的统计问题。例如,除了如图11.16所示拟合出生体重均值的二次曲线外,还拟合了标准差的三次曲线,并获得了多个年龄相关的百分位数,如图14.8中首胎男婴所示。
Exactly the same statistical problem arises in constructing 'standards' of fetal or child growth. For example, as well as fitting quadratic curves to mean birthweight as shown in Figure 11.16, cubic curves were fitted to the standard deviations and several age- related centiles obtained, as shown in Figure 14.8 for first born male babies.

14.5.5 讨论 14.5.5 Discussion

在临床实践中,常常根据某些临床或生化测量将受试者分类为正常或异常,以辅助决策和治疗。当有正常(健康)和异常(患病)受试者的数据时,我们就拥有了构成诊断测试基础的数据类型,如第14.4节所述。如果我们希望使用该测量本身作为异常的指标,则需要描述某个定义明确的群体(通常是健康受试者)中的变异。然而,建立参考区间不可避免地导致推断那些测量值落在区间之外的受试者是异常的。虽然这可能是事实,但这种推断并不成立,因为该区间根据定义排除了固定比例的小部分健康受试者,同时也因为患病受试者的变量值未知。当测量本身定义了疾病状态时,例如血压超过某一水平被称为“高血压”,逻辑则更加模糊(Pickering,1978年)。
It is common in clinical practice to classify subjects as normal or abnormal with regard to some clinical or biochemical measurement as an aid to decision- making and thus treatment. When data are available for normal (healthy) and abnormal (ill) subjects we have the type of data that form the basis of a diagnostic test, as discussed in section 14.4. If we wish to use the measurement itself to be a measure of abnormality, then we need to describe the variation among some defined group, usually of healthy subjects. The creation of a reference interval will, however, inevitably lead to the inference that subjects whose values fall outside the interval are abnormal. While this may be true, such an inference is not valid both because the interval by definition excludes a fixed small percentage of healthy subjects, and also because the values of the variable in ill subjects are not known. Where the measurement itself defines the condition, such as blood pressure above a certain level being termed 'hypertension', the logic becomes even more diffuse (Pickering, 1978).

从统计学角度看,最有趣的问题是使用参数法还是百分位数法。尽管百分位数方法因其简单性和对所有数据集的有效性而具有吸引力,但基于正态分布理论的参数法有两个重要优点。首先,定义参考区间的值的置信区间比等效的百分位数参考区间更窄。其次,使用正态分布可以将任何受试者的测量值表示为标准差分数,从而定位于特定的百分位数,这比仅知道其是否在参考区间内更具信息量。换言之,我们可以看到一个值有多么异常。(这里与P值有很强的类比。)因此,只要可能,将数据或其某种变换视为正态分布时,应使用参数法。
From a statistical point of view, the most interesting question is whether to use the parametric method or the percentile method. While the percentile approach is attractive both in its simplicity and validity for all data sets, there are two important advantages of using the parametric method based on Normal distribution theory. Firstly, the confidence intervals for the values defining the reference interval are much narrower than for the equivalent percentile reference interval. Secondly, the use of the Normal distribution allows any subject's measurement to be expressed as a standard deviation score, and hence located at a particular percentile, which is much more informative than knowing whether they are inside or outside the reference interval. In other words, we can see how unusual a value is. (There is a strong analogy here to P values.) Where it is possible, therefore, to treat the data or some transformation of the data as Normal the parametric approach should be used.

样本量应足够大以限制参考区间界限的不确定性,参数分析的最低样本量最好不少于100,百分位数法则不少于200。对于年龄相关区间,重要的是对不同年龄的数据进行平滑处理。除了平滑变化的数值更合理外,
The sample size should be large enough to restrict uncertainty about the limits of the reference interval, preferably with a bare minimum of 100 subjects for a parametric analysis and 200 for the percentile method. For age- related intervals it is important to smooth the data across ages. Apart from the fact that smoothly changing values are more plausible, there is

这还能更好地利用统计数据。在所有情况下,新的参考区间报告应明确受试者纳入标准及所用的统计方法。
much better statistical use of the data. In all cases, reports of new reference intervals should specify the criteria for inclusion of subjects and the statistical methods used.

14.6 连续测量 14.6 SERIAL MEASUREMENTS

14.6.1 引言 14.6.1 Introduction

两种类型的研究可能会对每个受试者进行一系列观察或连续测量。首先,是设计性研究,在预先选定的特定时间点对每个个体进行重复测量。即使每个个体都有完整的数据,如何恰当分析和解释这些数据仍不明显。其次,数据可能来自观察性研究,在未指定时间点进行多次测量。对于这类数据,观察的原因可能存在疑问。例如,怀孕期间多次测量血压的女性很可能属于高风险群体。
Two types of study may yield a series of observations, or serial measure ments, on each subject. Firstly, there are designed studies where repeated measurements are taken on each individual at specific times chosen in advance. Even when there are complete data for each individual, the appropriate analysis and interpretation of such data are not obvious. Secondly, data can arise from observational studies where multiple measurements are taken at unspecified times. With such data there may be doubts about the reason for the observations. For example, women with many measurements of blood pressure during pregnancy are likely to be a high risk group.

分析连续数据有多种方法,各有优缺点。尤其是一些方法在执行和解释上较为复杂,且有些方法仅适用于固定时间点的数据。这里我将考虑一种简单方法,该方法在大多数情况下能给出有用结果。它既可用于实验数据,也适用于观察数据,因此可用于存在缺失观测的结构化数据集,这种情况很常见。Matthews 等人(1990)对此方法有更详细的讨论。该方法将通过表14.13和图14.9中的数据进行说明,这些数据展示了四组女性在鼻腔给药后两小时内多个时间点的血清孕酮水平。
There are several approaches to analysing serial data, each with advantages and disadvantages. In particular some methods are complex both to perform and interpret, and some can be applied only to data at fixed time points. Here I shall consider a simple approach which gives useful results in most situations. It can be applied to experimental or observational data, and can thus be used for structured data sets with missing observations, which is a common phenomenon. A fuller discussion is given by Matthews et al. (1990). The method will be illustrated using the data in Table 14.13 and Figure 14.9, which show serum progesterone levels at several times up to two hours after nasal administration of progesterone for four groups of women.

14.6.2 常用分析方法 14.6.2 The usual approach to analysis

分析这类数据最常见的方法是对每个时间点进行独立分析,如两样本 检验或单因素方差分析。数据常通过连接各时间点均值的图形展示,通常附带 标准误(或 标准差)的误差线。此方法有几个重要缺点:
The most common method of analysing data like these is to perform independent analyses at each time point, such as two- sample tests or one way analysis of variance. Frequently the data are displayed graphically by a plot joining the mean values at each time point, often with 'error bars' of standard error (or perhaps standard deviation). There are several important criticisms of this approach:

  1. 忽略了研究设计,未考虑各时间点的数值来自同一受试者这一事实;

  2. It ignores the design of the study, as no account is taken of the fact that the values at each time point are from the same individuals;

  3. 连接均值的曲线可能不能很好地代表个体的典型曲线,且会掩盖不同个体曲线形状的差异;

  4. The curve joining the means may not be a good indicator of the typical curve for an individual, and will hide any variation in the shape of the curves for different individuals;

aannn aannn aannnnn aannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
aannn aannn aannnnn aannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

给药后时间(分钟)峰值(nmol/l)达到峰值时间(分钟)
01351015304560120
第1组(单侧鼻腔给药0.2 ml,浓度100 mg/ml孕酮)
11.0-10.016.022.020.016.0-18.014.0
26.5-9.511.617.527.528.522.419.310.0
33.04.04.013.015.819.521.215.910.713.4
41.02.19.7-21.8-27.5-15.56.2
51.01.01.04.222.623.945.542.635.010.6
61.01.01.01.03.914.717.616.18.810.8
均值(标准误)2.32.85.99.217.321.126.024.817.910.8
(0.9)(0.9)(1.8)(2.8)(2.9)(3.5)(4.4)(6.1)(3.8)(4.0)
第2组(单侧鼻腔给药0.3 ml,浓度100 mg/ml孕酮)
71.01.55.011.016.023.015.09.06.05.0
81.01.06.520.022.527.819.09.08.28.0
91.01.07.37.518.020.018.912.86.34.8
103.02.52.02.73.43.614.07.37.74.0
118.37.59.611.011.515.715.215.814.011.5
126.25.96.87.79.09.312.112.211.09.0
均值(标准误)3.23.26.210.013.416.615.711.08.17.1
(1.3)(1.1)(1.0)(2.4)(2.8)(3.7)(1.1)(1.3)(1.3)(1.1)
Time after administration (min)Peak value (nmol/l)Time to peak (min)
01351015304560120
Group 1 (0.2 ml of 100 mg/ml progesterone in one nostril)
11.0-10.016.022.020.016.0-18.014.0
26.5-9.511.617.527.528.522.419.310.0
33.04.04.013.015.819.521.215.910.713.4
41.02.19.7-21.8-27.5-15.56.2
51.01.01.04.222.623.945.542.635.010.6
61.01.01.01.03.914.717.616.18.810.8
Mean (SE)2.32.85.99.217.321.126.024.817.910.8
(0.9)(0.9)(1.8)(2.8)(2.9)(3.5)(4.4)(6.1)(3.8)(4.0)
Group 2 (0.3 ml of 100 mg/ml progesterone in one nostril)
71.01.55.011.016.023.015.09.06.05.0
81.01.06.520.022.527.819.09.08.28.0
91.01.07.37.518.020.018.912.86.34.8
103.02.52.02.73.43.614.07.37.74.0
118.37.59.611.011.515.715.215.814.011.5
126.25.96.87.79.09.312.112.211.09.0
Mean (SE)3.23.26.210.013.416.615.711.08.17.1
(1.3)(1.1)(1.0)(2.4)(2.8)(3.7)(1.1)(1.3)(1.3)(1.1)

aannn aannn aannn aannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
aannn aannn aannn aannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

给药后时间(分钟)峰值(nmol/l)达到峰值时间(分钟)
01351015304560120
第3组(单侧鼻腔给药200 mg/ml孕酮0.2 ml)
138.410.88.17.88.512.019.822.240.5
143.53.23.43.38.59.414.512.710.2
153.54.04.83.53.713.012.515.010.5
163.73.24.34.55.58.510.311.16.0
均值(标准误)4.85.35.24.86.710.714.315.316.7
(1.2)(1.8)(1.0)(1.0)(1.2)(1.1)(2.0)(2.5)(4.1)
第4组(双侧鼻腔各给药100 mg/ml孕酮0.2 ml)
175.05.66.17.213.826.026.125.720.5
184.55.113.221.026.828.022.017.815.7
198.46.28.018.533.835.026.223.019.0
204.23.24.24.810.313.717.118.315.0
均值(标准误)5.55.07.912.921.225.722.821.218.2
(1.0)(0.7)(1.9)(4.0)(5.5)(4.4)(2.2)(1.9)(1.0)
Time after administration (min)Peak value (nmol/l)Time to peak (min)
01351015304560120
Group 3 (0.2 ml of 200 mg/ml progesterone in one nostril)
138.410.88.17.88.512.019.822.240.5
143.53.23.43.38.59.414.512.710.2
153.54.04.83.53.713.012.515.010.5
163.73.24.34.55.58.510.311.16.0
Mean (SE)4.85.35.24.86.710.714.315.316.7
(1.2)(1.8)(1.0)(1.0)(1.2)(1.1)(2.0)(2.5)(4.1)
Group 4 (0.2 ml of 100 mg/ml progesterone in each nostril)
175.05.66.17.213.826.026.125.720.5
184.55.113.221.026.828.022.017.815.7
198.46.28.018.533.835.026.223.019.0
204.23.24.24.810.313.717.118.315.0
Mean (SE)5.55.07.912.921.225.722.821.218.2
(1.0)(0.7)(1.9)(4.0)(5.5)(4.4)(2.2)(1.9)(1.0)


图14.9 四组女性鼻腔给药孕酮后的血清孕酮水平。数据来源于表14.13。
Figure 14.9 Serum progesterone levels after nasal administration of progesterone in four groups of women. Data from Table 14.13.

3.当比较不同受试者组时,获得的多个非独立值难以解释,甚至不可能解释;
3. It is difficult, if not impossible, to interpret the multiple non-independent values that are obtained when different groups of subjects are compared;

4.无法对任何缺失观察值进行调整,因此不同时间点的数据可能不完全对应同一组受试者。
4. No allowance can be made for any missing observations, so the data at different times may not relate to exactly the same group.

上述第一点是关键,其他问题均由此引出。例如,如果前两组仅在15分钟时存在显著差异,我们如何解释图14.9中的数据分析结果尚不明确。此外,我们是否应考虑两组基线(0分钟)值的差异?如果是,应如何处理?此类研究的目的通常是评估随时间的反应,因此最好将分析方法针对临床目标进行调整。
The first point above is the critical one, from which the others follow. It is not at all clear how we would interpret the analysis of the data in Figure 14.9 if, for example, the first two groups were significantly different only at 15 minutes. Further, should we take account of any differences in baseline (time zero) values in the two groups and, if so, how? The purpose of this type of study is usually to assess the response over time, so it is far better to tailor the analysis to the clinical objective.

14.6.3 使用汇总指标进行分析 14.6.3 Analysis using summary measures

对序列测量数据进行分析时,最实用的一般方法是简化分析,将每个受试者的数据归纳为若干特定感兴趣的特征。可以对每个个体的数据拟合统计模型,或直接从观察数据中计算所需量。这些汇总指标随后
Probably the most useful general approach to the analysis of serial measurements is to simplify the analysis by reducing each subject's data to certain features of particular interest. Either a statistical model may be fitted to each individual's data or the necessary quantities can be derived directly from the observed data. These summary measures are then

以与原始观察值相同的方式进行分析。显然,该方法依赖于选择具有临床相关性的汇总指标的能力。
analysed in the same way as if they were the original observations. Clearly this approach relies on the ability to choose summary measures of clinical relevance.

对于临床测量,唯一常用的模型是对每个受试者的数据随时间拟合线性回归。回归线的斜率表示测量值随时间单位(如每小时)的变化率。显然,线性回归仅适用于数据随时间呈系统性上升或下降趋势的情况。许多数据集,如图14.9所示,呈现先上升后下降(或反之)的趋势。任何简单的统计模型都难以很好地拟合此类数据。
For clinical measurements the only commonly used model is to fit a linear regression of each subject's data on time. The slope of the line represents the rate of change of the measurement per unit of time (e.g. per hour). Clearly, linear regression is appropriate only for data which tend either to rise or fall systematically over time. Many data sets, such as that in Figure 14.9, have a general tendency to rise and then fall (or vice versa). It is unlikely that any simple statistical model would fit such data at all well.

一种更简单且常用的方法是直接从观察数据中提取汇总统计量,可能经过简单的数学计算。常见的派生统计量包括:
A simpler and more common approach is to take summary statistics directly from the observed data, perhaps after some simple mathematical calculation. Some of the more frequent derived statistics are:

所有测量值的平均值(即忽略时间响应)
峰值高度
达到峰值的时间
达到某一给定水平的时间
变化达到某一给定量的时间
高于某一给定水平的时间
达到相对于初始水平(基线)的最大变化的时间
返回(接近)基线水平的时间
从第一次测量到最后一次测量的变化
最终水平(可能是最后几次测量的平均值)
曲线下面积(AUC)
mean of all the measurements (i.e. ignore the time response) height of peak time to reach peak time to reach a given level time to change by a given amount time above a given level time to achieve maximum change from original level (baseline) time to return (near) to baseline level change from first to last measurement final level (perhaps the average of the last few measurements) area under the curve (AUC)

这些建议中有几项包含一些任意定义,这些定义应在分析之前确定,而非在观察数据后决定。部分建议专门针对有峰值的数据。当初始值变化较大时,可以使用相对于基线的变化。
Several of these suggestions incorporate some arbitrary definitions which should be chosen in advance of the analysis rather than after inspection of the data. Several are specifically aimed at data with peaks. Where initial values vary considerably the change from baseline may be used.

在某些情况下,AUC 可被解释为对干预措施的累计响应。其计算方法在第14.6.5节中有所描述。注意,对于等间隔观测,AUC(这些汇总统计中最难计算的)与所有测量值的平均值几乎相同。
The AUC may be interpreted in some circumstances as the cumulative response to the intervention. The calculation is described in section 14.6.5. Note that for equally spaced observations the AUC, which is the hardest of these summary statistics to calculate, is virtually the same as the mean of all the measurements.

Dalton 等人(1987)使用了三种指标来总结图14.9中的数据:峰值时间、从时间零点起的最大增量和AUC。一般来说,考虑两到三个衍生统计量是合理的,但和任何研究一样,最好确定一个主要关注的单一指标。合适指标的选择应与研究目标相关。例如,如果研究是治疗效果评估,我们可能最关注研究结束时的数值,或相对于起始值的变化。如果研究旨在评估镇痛药的有效性,那么我们可能更关注
Dalton et al. (1987) used three measures to summarize the data in Figure 14.9: the time of the peak, the maximum increase from time zero and the AUC. In general it is reasonable to consider two or three derived statistics, but as in any study it is highly desirable to identify a single measure of primary interest. The choice of appropriate measures should relate to the study objectives. For example, if the study is one of treatment efficacy we may reasonably be most interested in the values at the end of the study, perhaps in relation to starting values. If the study is to evaluate the effectiveness of analgesics, then we would probably be interested in the

药物的快速效果,可能通过观察峰值时间和达到的水平,以及高于某一关键水平的时间来评估。
rapid effectiveness of the drug, perhaps by looking at the timing of the peak and the level achieved, and perhaps also the time above some critical level.

虽然对汇总统计量的分析通常很简单,但这种方法也存在一些困难:
Although the analysis of summary statistics is usually simple, there are some difficulties with this approach too:

【1】 由于研究目标过于模糊,可能难以明确最重要的特征;

  1. it may be difficult to specify the feature(s) of major importance, because the study objective is too vague;

【2】 选择使用的统计量可能会受到对数据的观察影响;
2. the choice of statistics to use may be influenced by inspecting the data;

【3】 很难研究组间曲线形状的可能差异(但这本来就很难)。
3. it is difficult to study any possible variation between groups in the shape of the curves (but this is always difficult).

针对这些缺点,我们必须强调一些重要的额外优点;能够处理缺失观察值(见表14.13)和观察时间的变异;能够比较同一受试者在不同条件下的序列测量;以及结果易于理解和解释(这是多种替代方法的显著问题)。看似在分析汇总量时我们丢弃了大量数据,实际上大量观察值只是表面现象,因为同一患者的连续读数非常相似。患者是研究的单位,因此当我们每个患者只有一个数值时,处理这些数据更简单且更有意义。
Against these disadvantages we must set some important further advantages; the ability to cope with missing observations (see Table 14.13) and variable timing of observations; the ability to handle the comparison of serial measurements for the same subjects under different conditions; and the ease of understanding and explaining the results (a notable problem with several alternative approaches). It may seem that when we analyse summary measures we discard a lot of data. In fact the large number of observations is more apparent than real, as consecutive readings in any patient will be very similar. The patient is the unit of investigation, so it is easier and more meaningful to handle such data when we have only one value per patient.

14.6.4 图形展示 14.6.4 Graphical display

由于在每个时间点绘制均值可能产生误导,因此重要的是检查个体数据的图形,并尽可能将其包含在发表的论文中。图形可以迅速显示曲线是相似还是不同。不幸的是,图形展示仅对小样本有效。图14.9展示了原始血清孕酮数据的一种形式;图14.10展示了另一种替代形式。
Because of the potentially misleading effect of plotting mean values at each time point it is important to examine graphs of individuals' data, and if possible to include these in the published paper. A graph will show very quickly if the curves are similar or dissimilar. Unfortunately, graphical display is effective only for small samples. Figure 14.9 showed the raw serum progesterone data in one form; an alternative is shown in Figure 14.10.

也可以绘制汇总指标。一种对“峰值”数据有趣的展示方式是将峰值高度绘制为时间的函数。图14.11显示了孕酮数据的此类图形。这种图可能揭示其他图形中未显现的模式。更一般地,我们可以绘制任意两个汇总指标的散点图。示例中的数据是在所有受试者相同时间点收集的,但图形展示对不同时间点收集的数据可能更有用。
The summary measures can also be plotted. One interesting format for 'peaked' data is to plot the height of the peak against its time. Figure 14.11 shows such a plot for the progesterone data. This type of plot may reveal patterns that are not evident in other graphs. More generally, we can produce a scatter diagram of any two summary measures. The data in the example were collected at the same times for all subjects, but graphical display may be even more useful for data collected at varying times.

14.6.5 曲线下面积 14.6.5 The area under the curve

曲线下面积(AUC)是总结单个个体一系列测量信息的有用方法。它是
The area under the curve (AUC) is a useful way of summarizing the information from a series of measurements on one individual. It is


图14.10 图14.9中血清孕酮数据的另一种展示方式。
Figure 14.10 Alternative display of serum progesterone data in Figure 14.9.


图14.11 孕酮峰值随时间的变化图。
Figure 14.11 Plot of peak values of progesterone by time.

在临床药理学中经常使用AUC,血清水平的AUC可解释为所给药物的总吸收量或生物利用度。
frequently used in clinical pharmacology, where the AUC from serum levels can be interpreted as the total uptake or bioavailability of whatever had been administered.

数据点通过直线连接形成“曲线”。AUC通常通过累加相邻两次观测之间曲线下的面积计算。如果我们在时间点 分别有测量值 ,那么这两个时间点之间的AUC是时间差与两次测量值平均值的乘积。即
The data are joined by straight lines to get a 'curve'. The AUC is usually calculated by adding the areas under the curve between each pair of consecutive observations. If we have measurements and at times and , then the AUC between those two times is the product of the time difference and the average of the two measurements. Thus we get

。这被称为梯形法则,因为曲线下面积的每一段形状类似梯形。
. This is known as the trapezium rule because of the shape of each segment of the area under the curve.

如果我们有 个在时间点 () 的测量值 ,则AUC计算公式为
If we have measurements at times then the AUC is calculated as

AUC的单位是 单位的乘积,例如 nmol·min/l,较难直观理解。将AUC除以总时间,可以得到一个时间段内的加权平均水平,这通常更有用。
The units of the AUC are the product of the units used for and , for example nmol.min/l, and are not easy to understand. It may be useful to divide the AUC by the total time to get a sort of weighted average level over the time period.

表14.13中第一个受试者的计算过程如下。该受试者有八次观测,因此需计算七个面积。计算为
The calculation for the first subject in Table 14.13 goes as follows. There were eight observations for this subject, so seven areas to calculate. We have

该值也可表示为平均水平,即
This value can also be expressed as an average level of .

即使存在缺失数据,只要最后一次观测值不缺失,也能计算AUC。
We can calculate the AUC even when there are missing data, except when the final observation is missing.

14.6.6 解释 14.6.6 Interpretation

进行与临床兴趣问题无关的分析往往导致错误推断。当数据在多个时间点分别分析时,常见的推断是基于组间显著差异首次出现的时间点。显然,这个答案强烈依赖于样本量,且科学可信度较低。在表格或图形中展示所有原始数据很有价值,但在大型研究中可能难以实现。
Performing an analysis that does not relate to the questions of clinical interest often leads to incorrect inferences. When data are analysed separately at each of several time points it is common to see inferences based upon the time when groups become significantly different. Clearly the answer to this question will depend strongly on sample size, and has little if any scientific credibility. Presentation of all the raw data either in a table or figure is valuable, but neither may be feasible in a large study.

以汇总统计量作为统计分析基础,能避免许多困难,因为分析直接针对一个或多个具体感兴趣的问题。每个受试者仅有一个“观测值”,使解释更为简便。可以使用简单的估计和假设检验方法。
The use of summary statistics as the basis of statistical analysis avoids many difficulties by relating the analysis directly to one or more questions of specific interest. Interpretation is usually simplified by having one 'observation' per subject. Simple methods of estimation and hypothesis testing can be used.

14.7 周期性变化 14.7 CYCLIC VARIATION

许多测量值会随一天中时间的变化而变化。例如,大多数人的血压在夜间最低,早晨最高。
Many measurements vary according to time of day. For example, most people's blood pressure is lowest at night and highest during the morning.

昼夜节律变化也出现在许多激素水平中,甚至我们的身高在晚上通常比早晨略低。
Circadian variation is also seen in many hormone levels and even our height tends to be slightly lower in the evening than in the morning.

同样,个体测量值以及群体数据也可能随月份变化。表14.14显示了比利时5000多名新生儿按出生月份分类的脐带血IgE正常与异常的数量。高IgE水平用于检测易过敏的个体,该研究旨在验证此前一项发现出生月份与IgE水平相关的研究结果。
Similarly, individual measurements and also population data may vary by month of the year. Table 14.14 shows the number of births with normal and abnormal cord blood IgE levels by month of birth in a study of over 5000 Belgian newborns. A high level of IgE is used to detect those predisposed to become allergic, and the study was carried out to confirm the results of a previous study that had found an association with month of birth.

表14.14 按出生月份分类的脐带血IgE(Kimpen等,1987)
Table 14.14 Cord blood IgE by month of birth (Kimpen et al., 1987)

月份婴儿数量
总数正常IgE (≤ 1.0 IU/ml)异常IgE (> 1.0 IU/ml)% 异常
一月331319123.6
二月416401153.6
三月528503254.7
四月503481224.4
五月496468285.6
六月462447153.2
七月518504142.7
八月411396153.6
九月45644971.5
十月44643792.0
十一月37436861.6
十二月412398143.4
MonthNumber of babies
TotalNormal IgE (≤ 1.0 IU/ml)Abnormal IgE (> 1.0 IU/ml)% Abnormal
January331319123.6
February416401153.6
March528503254.7
April503481224.4
May496468285.6
June462447153.2
July518504142.7
August411396153.6
September45644971.5
October44643792.0
November37436861.6
December412398143.4

当数据来自有序分组时,我们应直接检验线性趋势的可能性。对于如IgE值这类按月份排序的数据,组是有序的,但同时具有周期性。显然,寻找线性趋势没有意义;我们应探索系统的周期性趋势。这类数据可能来自对同一组个体的重复测量,或不同时间点来自独立受试者组。当不同时间点数据来自同一组个体时,该分析即为序列测量分析的一种特殊形式。例子包括月经周期中的激素水平测量或24小时内的血压监测。
When data come from ordered groups we should examine directly the possibility of a linear trend. With data like the IgE values, which relate to months, the groups are ordered but are also cyclic. Clearly it makes no sense to look for a linear trend; rather, we should explore the possibility of a systematic cyclic trend. Data like these may arise from repeated measurement of the same individuals, or where the data at different times are from independent groups of subjects. When data at different times come from the same individuals this analysis is thus a special form of the analysis of serial measurements. Examples are the measurement of hormone levels throughout the menstrual cycle or blood pressure over 24 hours.

已有多种方法用于分析此类数据。频数数据可用Freedman(1979)提出的非参数方法分析,例如检验疾病新发病例是否存在季节性变化。连续变量或比例则可通过拟合正弦曲线进行分析。
Several methods exist for analysing such data. Frequencies can be analysed using a non- parametric method given by Freedman (1979), for example to see if the incidence of new cases of disease varies seasonally. Continuous variables or proportions can be examined by fitting a sinusoidal


图14.12 显示了IgE值超过 的观察百分比及拟合的正弦曲线。
Figure 14.12 Observed percentages of IgE values above and fitted sine curve.

(或正弦)曲线拟合数据。这种分析可以看作是一种复杂形式的回归。图14.12展示了异常IgE值的观察比例及拟合曲线。这里未详细描述的分析显示出高度显著的季节性模式。
(or sine) curve to the data. This analysis can be regarded as a complex form of regression. Figure 14.12 shows the observed proportions of abnormal IgE values together with the fitted curve. The analysis, which is not described here, shows a highly significant seasonal pattern.

循环变化可能需要复杂的统计分析。这里引入该主题的目的是再次强调,在选择最合适的分析方法时,必须明确考虑数据的性质。我建议对这类数据寻求专业统计咨询。
Cyclic variation may require complicated statistical analysis. The purpose of introducing the topic here is to show again how the nature of the data needs to be considered explicitly when selecting the most appropriate analysis. I recommend expert statistical advice for data of this type.

练习 EXERCISES

【14】1 下表显示了19名患者同时使用放射性 和非放射性(生物素)细胞标记测量的红细胞容量(Cavill等,1988):
14.1 The following table shows red cell volume measured simultaneously in 19 patients using radioactive and non radioactive (biotin) cell labels (Cavill et al., 1988):

患者51Cr容量(ml)生物素容量(ml)
112671954
217101651
318821887
419142043
519402054
619762075
720331976
820392120
Patient51Cr volume (ml)Biotin volume (ml)
112671954
217101651
318821887
419142043
519402054
619762075
720331976
820392120

436 医学研究中的一些常见问题
436 Some common problems in medical research

患者51Cr容量(ml)生物素容量(ml)
920772061
1020872152
1121021894
1221391982
1321842153
1421922288
1523932628
1624252495
1725542463
1826003186
1934203488
Patient51Cr volume (ml)Biotin volume (ml)
920772061
1020872152
1121021894
1221391982
1321842153
1421922288
1523932628
1624252495
1725542463
1826003186
1934203488

作者使用Wilcoxon配对秩和检验比较了这两组数据,得到的结果是 。他们得出结论,两种方法的比较“未显示出一致的临床显著差异”。
The authors compared the two sets of data by the Wilcoxon matched pairs rank sum test, for which they got . They concluded that the comparison of methods 'showed no consistent clinically significant difference between the two'.

(a) 对他们的分析和解释进行评论。
(a) Comment on their analysis and interpretation.

(b) 进行更好的分析。
(b) Carry out a better analysis.

(c) 患者均被转诊进行红细胞体积测量这一事实有何意义?
(c) What is the relevance of the fact that the patients had all been referred for the measurement of red cell volume.

(d) 两种方法之间最大的差异出现在受试者1号和18号。生物素法受之前食用鸡蛋的影响,作者指出“至少有一名患者早餐吃了鸡蛋”。分析时应考虑这一信息吗?
(d) The largest differences between the methods are those for subjects 1 and 18. The biotin method is affected by prior consumption of eggs, and the authors note that 'at least one of these patients had had an egg for breakfast'. Should the analysis take account of this information?

14.2 Furst 和 Paulus(1975)报道了一项研究,比较了12名类风湿关节炎患者和12名正常对照者对克罗尼辛的代谢情况。该药物作为治疗类风湿关节炎的抗炎镇痛药正在研究中。单次服用三片250 mg克罗尼辛后,于0、、1、2、4、6和8小时测定血清中克罗尼辛水平。作者未报告初始(0小时)数值;其余数据如下:
14.2 Furst and Paulus (1975) reported a study to compare the metabolism of clonixin in 12 patients with rheumatoid arthritis and 12 normal controls. The drug was under investigation as an anti- inflammatory analgesic for treatment of rheumatoid arthritis. Serum clonixin levels were measured at 0, , 1, 2, 4, 6 and 8 hours after administration of a single dose of three tablets of clonixin. The authors did not report the initial (0 hour) values; the remaining data are shown below:

类风湿关节炎患者:
Patients with rheumatoid arthritis:

患者克罗尼辛水平(μg/ml)
0.51时间(小时)
2468
112.7032.2042.0019.807.092.10
218.4840.2445.8715.615.583.25
36.7020.6027.7011.492.480.56
PatientClonixin levels (μg/ml)
0.51Time (hours)
2468
112.7032.2042.0019.807.092.10
218.4840.2445.8715.615.583.25
36.7020.6027.7011.492.480.56
患者克罗尼辛水平(μg/ml)
时间(小时)
0.512468
424.2016.207.845.300.380.00
514.7028.3031.9016.089.203.60
66.5529.1733.3015.173.170.00
741.7029.4016.907.043.482.56
81.4947.2632.7815.894.722.61
913.0419.0839.4712.424.912.86
1029.2844.9445.7212.714.431.67
118.6120.3444.336.742.151.11
1228.1056.1036.6819.105.621.82
PatientClonixin levels (μg/ml)
Time (hours)
0.512468
424.2016.207.845.300.380.00
514.7028.3031.9016.089.203.60
66.5529.1733.3015.173.170.00
741.7029.4016.907.043.482.56
81.4947.2632.7815.894.722.61
913.0419.0839.4712.424.912.86
1029.2844.9445.7212.714.431.67
118.6120.3444.336.742.151.11
1228.1056.1036.6819.105.621.82

对照组受试者:
Control subjects:

患者克罗尼辛水平(μg/ml)
时间(小时)
0.512468
PatientClonixin levels (μg/ml)
Time (hours)
0.512468

(a) 绘制各组的平均水平曲线。
(a) Plot the mean levels in each group.

(b) 使用合适的分析方法比较两组的峰值水平和曲线下面积,假设时间零点时克洛昔芬浓度为0.0。(曲线下面积在计算机程序中易于计算,但手工计算相当繁琐。)
(b) Compare the peak levels and the area under the curve in the two groups using a suitable analysis assuming that the clonixin level is 0.0 at time zero. (The AUC is easy to calculate in a computer program, but is rather tedious to do by hand.)

(c) 这些图是否
(c) Are the plots from
(a) 能很好地代表数据?
(a) a good representation of the data?

14.3 对关于测谎仪(测谎器)的研究文献进行检索后,评估该机器的敏感性和特异性分别为0.76和0.63(Brett等,1986)。建议将测谎仪与询问潜在献血者是否吸毒相结合使用。(假设所有非吸毒者都说实话。)
14.3 A search of the literature for studies concerning the polygraph (lie- detector) led to the assessment of the sensitivity and specificity of the machine as 0.76 and 0.63 respectively (Brett et al., 1986). It is proposed that the polygraph be used in association with questioning potential blood donors about whether they are drug users. (Assuming that all non- drug users tell the truth.)

(a) 如果5%的潜在献血者吸毒,且其中三分之一对这一事实撒谎,那么来自吸毒者的献血比例是多少?
(a) If of potential donors use drugs and a third of them lie about it, what proportion of blood donations will be from drug users?
(b) 在测谎测试中,失败者中有多少比例是药物使用者?
(b) What proportion of people failing the polygraph test will be drug users?

14.4 急性下呼吸道感染是发展中国家婴儿及5岁以下儿童死亡的最常见原因之一。需要一种简单的测试来识别那些患有急性呼吸道感染且属于下呼吸道感染(LRI)的婴儿,这些婴儿应接受抗生素治疗,与上呼吸道感染(URI)患者区分开来。以下数据来自一项关于呼吸频率在此目的中的实用性的研究(Cherian 等,1988):
14.4 Acute lower respiratory tract infection is one of the commonest causes of death among infants and under- 5s in developing countries. A simple test is needed to identify those infants with acute respiratory infection who have lower respiratory tract infection (LRI) and should receive antibiotics from those with upper respiratory tract infection (URI). The following data come from a study of the usefulness of the respiratory rate for this purpose in infants (Cherian et al., 1988):

呼吸频率(次/分钟)儿童数量(%)
LRIURI
0–301 (1%)16 (11%)
31–404 (3%)77 (51%)
41–5010 (7%)46 (30%)
51–6041 (29%)9 (6%)
61+86 (61%)3 (2%)
总计142 (100%)151 (100%)
Respiratory rate (breaths/min)Number of children (%)
LRIURI
0–301 (1%)16 (11%)
31–404 (3%)77 (51%)
41–5010 (7%)46 (30%)
51–6041 (29%)9 (6%)
61+86 (61%)3 (2%)
Total142 (100%)151 (100%)

(a) 针对30、40、50和60次/分钟这四个临界值,构建表,将低呼吸频率和高呼吸频率与正确分类(LRI或URI)对应起来。哪一个临界值在敏感性和特异性之间达到最佳平衡?(即两者之和最大时)
(a) Construct tables for each of the four cut-offs 30, 40, 50 and 60 breaths/min relating low and high respiratory rate to the correct classification (LRI or URI). Which cut-off gives the best balance of sensitivity and specificity? (This is where their sum is a maximum.)

(b) 报告作者估计,在发展中国家所有急性呼吸道感染婴儿中,LRI的患病率为。在患病率为的情况下,哪一个临界值在阳性预测值和阴性预测值之间达到最佳平衡?
(b) The authors of the report estimated that the prevalence of LRI among all infants with acute respiratory infection in a developing country is . Which cut-off gives the best balance of positive and negative predictive values when the prevalence is

(c) 如果将呼吸频率超过 次/分钟作为下呼吸道感染(LRI)的指标,并对所有此类儿童使用抗生素治疗,那么接受治疗的婴儿中有多少比例是被不必要地治疗了?有多少比例的LRI婴儿不会得到抗生素治疗?
(c) If a respiratory rate of breaths/min is taken as an indication of LRI and all such children are treated with antibiotics, what proportion of treated infants will have been treated unnecessarily? What proportion of LRI infants would not get antibiotics?

(d) 目前,全科医生无法判断哪些婴儿患有LRI,约有 的呼吸道感染婴儿(包括LRI或URI)接受抗生素治疗。按照(c)中建议的政策,抗生素使用量会有什么影响?
(d) At present general practitioners cannot tell which infants have LRI and about of infants with respiratory tract infection (LRI or URI) receive antibiotics. What would be the effect on the amount of antibiotics used of the policy suggested in
(c)?

14.5 一项观察者变异性的研究使用对3869个臼齿和前臼齿的放射线诊断龋齿(Espeland和Handelman,1989)。下表显示了三位牙医的诊断结果。牙齿被诊断为健康(S)或龋齿(C)。
14.5 A study of observer variation was performed using radiographic diagnosis of caries on 3869 molars and premolars (Espeland and Handelman, 1989). The following table shows the results for three dentists. Teeth were diagnosed as sound (S) or carious (C).

牙医
123频数
SSS2128
SSC1122
SCS54
SCC226
CSS36
CSC87
CCS7
CCC209
Dentist
123Frequency
SSS2128
SSC1122
SCS54
SCC226
CSS36
CSC87
CCS7
CCC209

(a) 哪对牙医之间的诊断一致性最好?
(a) Which pair of dentists agreed best?

(b) 这种一致性水平是否良好?
(b) Is this a good level of agreement?

15 临床试验 15 Clinical trials

在对照试验中,和所有实验工作一样,追求精确不应以牺牲常识为代价。
In a controlled trial, as in all experimental work, there is no need for the search for precision to throw sense out of the window.

Hill (1963)
Hill (1963)

15.1 引言 15.1 INTRODUCTION

临床试验是一种针对人体的有计划实验,旨在评估一种或多种治疗方法的有效性。试验可以用于评估任何被视为潜在治疗手段的内容,范围广泛,包括药物、外科手术、物理治疗、饮食、针灸、健康教育等。本文将使用“临床试验”一词指代任何此类研究。
A clinical trial is a planned experiment on human beings which is designed to evaluate the effectiveness of one or more forms of treatment. Trials can be carried out to evaluate anything that may be considered a potential treatment in its widest sense, such as drugs, surgical procedures, physiotherapy, diet, acupuncture, health education, and so on. I shall use the term clinical trial to refer to any such study.

临床试验因其医学重要性、设计和分析中的特殊问题以及某些伦理问题而值得特别关注。该方法论约在50年前引入医学研究,最著名的早期例子是比较链霉素加卧床休息与单纯卧床休息治疗肺结核的试验(MRC,1948)。在1940年代之前,比较性临床试验几乎不为人知。Pocock(1983,第14页)对临床试验的发展历史做了总结。
Clinical trials merit special attention because of their medical importance, some particular problems in design and analysis, and certain ethical problems. The methodology that is used was introduced into medical research about 50 years ago, with the most famous early example being a trial comparing streptomycin and bed rest with bed rest alone in the treatment of pulmonary tuberculosis (MRC, 1948). Comparative clinical trials were virtually unknown before the 1940s. Pocock (1983, p. 14) gives a summary of the historical development of clinical trials.

在制药行业中,临床试验被划分为以下四类:
Within the pharmaceutical industry clinical trials are classified into one of four categories:

  1. 第一阶段:临床药理学和毒理学;

  2. Phase I: Clinical pharmacology and toxicity;

  3. 第二阶段:初步临床调查;

  4. Phase II: Initial clinical investigation;

  5. 第三阶段:治疗的全面评估;

  6. Phase III: Full scale evaluation of treatment;

  7. 第四阶段:上市后监测。

  8. Phase IV: Postmarketing surveillance.

本章我将仅讨论第三阶段试验。其显著特点是涉及两个或多个治疗方案的直接比较。它们通常被称为比较试验或对照试验。尽管有些对照试验设计用于比较多于两种治疗方案,我将重点关注常见的两组情况。通常,我会将这两种治疗视为一种实验性治疗,可能是一种新药,另一种为对照治疗,可能是
In this Chapter I shall consider only Phase III trials. They have the distinguishing feature that they involve direct comparison between two or more treatments. They are often referred to as comparative trials or controlled trials. Although some controlled trials are set up to compare more than two treatments I shall concentrate on the common two group case. I shall usually consider the two treatments to be an experimental treatment, perhaps a new drug, and a control treatment, which may be a

标准治疗、安慰剂,甚至完全不治疗,具体取决于具体情况。
standard treatment, a placebo, or even no treatment at all, depending on circumstances.

实际上,绝大多数比较临床试验具有某些共同特征,这使得我们能够对设计、分析和解释提供一般性指导。也许正因如此,临床试验可能是医学研究中统计学思想和方法论融合最为成功的领域。
In practice the vast majority of comparative clinical trials have certain features in common which makes it possible to give general guidance on design, analysis and interpretation. Perhaps for this reason, clinical trials are probably the area of medical research where the integration of statistical ideas and methodology has been most successful.

临床试验的核心理念是我们希望比较的患者组仅在所接受的治疗上有所不同。若组间在其他方面存在差异,则治疗比较存在偏倚。如果能够识别出偏倚,可能在分析中加以调整,但未知的偏倚无法处理。本章介绍的设计和分析方法旨在消除偏倚。
The key idea of a clinical trial is that we wish to compare groups of patients who differ only with respect to their treatment. If the groups differ in some other way then the comparison of treatments is biased. If we can identify a bias then it may be possible to allow for its effect in the analysis, but unknown biases cannot be dealt with. The methods of design and analysis described in this chapter are aimed at the elimination of bias.

对本章所涉及问题的更深入探讨,以及未涵盖的相关主题,可参阅多部专门论述临床试验的著作,其中尤以Pocock(1983年)一书推荐。此外,Peto等人(1976年)和Pocock(1985年)的论文讨论了一些较为棘手的问题。最后,Bradford Hill(1984年)著名著作中关于临床试验的章节也蕴含丰富智慧。
Deeper consideration of the issues covered in this chapter, as well as topics not covered here, can be found in several books devoted to clinical trials, of which that by Pocock (1983) is particularly recommended. In addition, the papers by Peto et al. (1976) and Pocock (1985) discuss some of the trickier issues. Lastly, much wisdom can be found in the chapter on clinical trials in the famous book by Bradford Hill (1984).

15.2 临床试验的设计 15.2 DESIGN OF CLINICAL TRIALS

15.2.1 设立对照组的必要性 15.2.1 The need for a comparison group

新治疗的引入是一个漫长而复杂的过程,许多看似有前景的疗法最终都未能成功。开始时,通常会先在一些患者身上尝试新治疗以观察效果。这类研究是无对照的,因此患者身上观察到的任何益处或不良反应自然都会被归因于该治疗。这类研究通常是开放性的,临床医生和患者都知道每位患者接受的治疗。研究者对新治疗的自然热情可能会影响他对患者病情进展的判断,也可能传递给患者,影响他们的健康状况,尤其是在症状主观的疾病中,如疼痛程度。许多早期这类研究曾显示新治疗效果显著,但经过更仔细的检验后,这种表面上的益处往往消失。有些情况下,早期结果甚至导致治疗被采用,而没有进行我们现在认为足够的调查。有多个例子表明,某些治疗在多年临床应用后被发现无效。其中一个例子是胃冷冻治疗十二指肠溃疡,这种方法在七年内被发现、采用又废弃(Miao,1977)。一个特别显著的例子是婴儿视网膜后纤维增生症流行导致失明的故事。
The introduction of a new treatment is a long and complex affair, and many apparently promising therapies fall by the wayside. It is natural to begin investigation by trying a new treatment on some patients to see what happens. This type of study is uncontrolled, so that any benefits or harmful effects seen in the patients will naturally be ascribed solely to the treatment. Such studies are usually open, where the clinician and the patients know what treatment each patient is getting. The investigator's natural enthusiasm for the new treatment may well influence his judgement of the patients' progress, and may also be transmitted to the patients and affect their well- being, especially for conditions where symptoms are subjective, such as degree of pain. Many early studies of this type have suggested that new treatments were highly effective, only for this apparent benefit to disappear on more careful examination. In some cases early results may lead to a treatment being adopted without what we would now consider to be adequate investigation. There are several instances of treatments being investigated after many years' clinical use and being found ineffective. One such was gastric freezing as a treatment for duodenal ulcer, which was discovered, adopted and abandoned within the space of seven years (Miao, 1977). A particularly marked example is the story of the epidemic in babies of retrolental fibroplasia leading to

20世纪50年代,极早产婴儿被给予高剂量氧气治疗。然而,用促肾上腺皮质激素治疗出现早期眼部变化的婴儿的成功率为75%。氧气和激素治疗都在没有对照试验的情况下被采用。经过数年临床使用后,迟来的临床试验发现激素治疗无效—75%的此类婴儿无需治疗即可恢复正常—而氧气治疗则明显有害;它实际上是导致失明的原因(Silverman,1985)。
blindness. In the 1950s high doses of oxygen were given to very premature babies. However, the treatment of infants with early eye changes with adrenocorticotrophic hormone had a success rate. Both the oxygen and hormone treatments had been adopted without the benefit of controlled trials. Only after several years of clinical use was it found, after clinical trials were belatedly carried out, that the hormone treatment was ineffective - of such infants return to normal without treatment - and that the oxygen treatment was positively harmful; it caused the blindness in the first place (Silverman, 1985).

非对照实验有其适用场合,上文称之为第二阶段试验,但它们往往给出过于乐观且因此存在偏倚的结果。对新治疗的最终评估应基于与替代治疗效果的比较。
There is a place for uncontrolled experiments, designated above as Phase II trials, but they tend to give over- optimistic, and hence biased, results. Definitive assessment of a new treatment should be in relation to the effectiveness of an alternative treatment.

正如我们将看到的,如果两种治疗同时进行研究,治疗分配采用随机过程,且患者和临床医生均不知道所接受的治疗,这将带来重大优势。随机双盲对照试验通常被视为评价试验设计质量的“金标准”。
As we will see, there are major advantages if the two treatments are investigated concurrently, allocation of treatments to patients is by a random process, and neither the patient nor the clinician knows which treatment was received. The randomized double- blind controlled trial is usually taken as the 'gold standard' against which to judge the quality of the design of a trial.

15.2.2 随机分配 15.2.2 Random allocation

设计中的一个关键问题是确保治疗分配与患者特征无关—换言之,分配过程必须无偏。最广泛使用的无偏治疗分配方法是随机分配,决定每位患者接受哪种治疗。正如我们在第5章看到的,随机分配使所有受试者接受任一治疗的机会均等,因此从定义上讲是无偏的。使用随机抽样的另一个重要原因是统计分析方法基于对具有特定特征总体的随机样本的预期。
A vital issue in design is to ensure that the allocation of treatments to patients is independent of the characteristics of the patients - in other words, it is carried out in an unbiased way. The most widely used method of unbiased treatment allocation is to use random allocation to determine which treatment each patient gets. As we saw in Chapter 5, random allocation gives all subjects the same chance of receiving either treatment and is thus unbiased by definition. Another important reason for using random sampling is that statistical methods of analysis are based on what we expect to happen in random samples from populations with specified characteristics.

尽可能使接受不同治疗的患者组在可能影响预后特征上非常相似是高度期望的。例如,在大多数研究中,确保各组年龄分布相似非常重要,因为预后常与年龄相关。然而,随机化并不能保证各组必然非常相似。组间的差异可能由偶然引起,但这类差异至少会带来不便,甚至可能影响试验结果的解释。虽然可以通过调整分析方法来考虑组间起始差异(见第15.4节),但更好的是在设计阶段控制这一问题。最明显的方法是使用
It is highly desirable that, as far as is possible, the groups of patients receiving the different treatments are very similar with regard to features that may affect how well they do, that is in their prognosis. For example. in most studies it is important that the age distribution of the groups similar, because prognosis is very often related to age. There is a guarantee, however, that randomization will in fact lead to the groups being very similar. Any differences between the groups will have arise a chance, but such differences can be at least inconvenient, and may lead to doubts being cast on the interpretation of the trial results. While it s possible to modify the analysis to take account of any differences between the groups at the start (see section 15.4), it is far better to try to control the problem at the design stage. Most obviously this can be done by using

分层随机化,如第5.7.3节所述。如果我们事先知道有几个关键变量对预后有强烈影响,那么可以将它们纳入分层随机化方案中。如第5.7.3节所观察,分层随机化必须使用区组随机化,否则不会优于简单随机化。可能还有其他重要变量我们无法测量或未识别,必须依赖随机化来平衡它们。采用分层设计的好处并未被广泛接受(Peto等,1976;Meier,1981),尤其是因为增加的复杂性带来了更多执行错误的可能性。另一种获得匹配良好组的方法是使用下一节描述的最小化技术。
stratified randomization, as described in section 5.7.3. If we know in advance that there are a few key variables that are strongly prognostic then they can be incorporated into a stratified randomization scheme. As observed in section 5.7.3, it is essential that stratified randomization uses blocking, otherwise there is no benefit over simple randomization. There may well be other important variables that we cannot measure or have not identified, and we must rely on the randomization to balance them out. The benefits of having a stratified design are not widely accepted (Peto et al., 1976; Meier, 1981), especially as the increased complexity gives more scope for errors in execution.A different method of obtaining well- matched groups is to use the technique of minimization described in the next section.

另一种获得匹配良好组的方法是使用下一节描述的最小化技术。
A different method of obtaining well- matched groups is to use the technique of minimization described in the next section.

15.2.3 最小化 15.2.3 Minimization

前一节强调了比较研究中随机分配的重要性。临床试验中使用非随机对照会严重降低结果的可信度。
The desirability of random allocation in comparative studies was stressed in the previous section. The use of non- random controls in clinical trials severely lessens the credibility of the results.

然而,最小化是一种可以安全使用的非随机方法。事实上,除非样本量很大,它比简单或分层随机抽样具有明显优势。使用最小化可以使治疗组在多个变量上非常相似,即使在小样本中也如此。它尤其适用于小型试验和从多个中心招募少量患者的试验。
Minimization is one non- random method, however, that can be used safely. Indeed, it has definite advantages over both simple or stratified random sampling, unless the sample size is large. The use of minimization will provide treatment groups very closely similar for several variables, even in small samples. It is especially suitable for smaller trials and for trials where small numbers of patients are recruited from each of several centres.

表15.1 乳腺癌患者胸腔积液控制中使用mustine与滑石粉对照试验的部分基线特征(Fentiman等,1983)
Table 15.1 Some baseline characteristics of patients in a controlled trial of mustine versus talc in the control of pleural effusions in patients with breast cancer (Fentiman et al., 1983)

治疗
Mustine (n = 23)滑石粉 (n = 23)
平均年龄(标准误)50.3 (1.5)55.3 (2.2)
疾病分期:
1或2期52%74%
3或4期48%26%
乳腺癌诊断至积液诊断的平均间隔(月)(标准误)33.1 (6.2)60.4 (13.1)
绝经后43%74%
Treatment
Mustine (n = 23)Talc (n = 23)
Mean age (SE)50.3 (1.5)55.3 (2.2)
Stage of disease:
1 or 252%74%
3 or 448%26%
Mean interval in months between breast cancer diagnosis and effusion diagnosis (SE)33.1 (6.2)60.4 (13.1)
Postmenopausal43%74%

表15.1显示了随机分配接受mustine或滑石粉治疗胸腔积液的乳腺癌患者的一些特征。该小规模试验采用了简单随机化,结果两组治疗在多个方面存在明显差异。分层随机化本可有所帮助,但在如此小的试验中对多个变量进行分层不可行。使用最小化方法,两组患者在所有这些变量上会非常相似,结果也会更具说服力。
Table 15.1 shows some characteristics of breast cancer patients randomized to receive either mustine or talc as a treatment for pleural effusions. Simple randomization was used in the small trial, and by chance the two treatment groups were noticeably different. Stratified randomization would have helped, but it is not feasible to stratify on several variables in such a small trial. With minimization the two groups would have been very similar with respect to all of these variables, and the results would have been more convincing.

最小化基于与随机化完全不同的原理。若将试验患者视为一个接一个到来,首位患者随机分配治疗。对于每位后续患者,确定哪种治疗能使各组在关注变量上达到更好的平衡。然后根据加权随机化(参见5.7.1节),倾向于分配能最小化不平衡的治疗。例如,可以采用4比1的加权,使患者有80%的概率接受使不平衡最小的治疗。该方法使得各组在所选变量上的相似性远高于简单随机化。
Minimization is based on a completely different principle from randomization. If we regard the patients for the trial as arriving one at a time, then the first patient is given a treatment at random. For each subsequent patient we determine which treatment would lead to better balance between the groups with respect to the variables of interest. The patient is then randomized using a weighting (see section 5.7.1) in favour of the treatment which would minimize the imbalance. For example, we might use a weighting of 4 to 1, so that there is an chance of each patient getting the treatment that minimizes the imbalance. The effect of this procedure is that the groups will be much more similar with regard to the chosen variables than they would be with simple randomization.

假设mustine与滑石粉的试验采用了基于表15.1中四个变量的最小化方法。对于每个变量,我们将可能的取值分为两组,如下:
Suppose that the mustine vs talc trial had used minimization based on the four variables shown in Table 15.1. For each variable we can divide the possible values into two groups, as follows:

年龄(岁) ≤50 或 >50;疾病分期 1或2期 或 3或4期;癌症诊断至积液诊断间隔(月) ≤30 或 >30;绝经状态 绝经前 或 绝经后
Age (years) or Stage of disease 1 or 2 or 3 or 4 Time between diagnosis of cancer and or diagnosis of effusions (months) Menopausal status Pre or Post

假设在29名患者入组后,各治疗组中各亚组人数如表15.2所示。现有一名患者拟入组,特征为:57岁,3期,间隔22个月,绝经后。表15.3显示了两组中已具备该患者特征的女性人数。为使两组尽可能相似,该患者应分配至人数较少的治疗组。此处应采用加权随机化,权重倾向于滑石粉组。
Suppose that after 29 patients had entered this trial the numbers in each subgroup in each treatment group were as shown in Table 15.2. We now wish to enter into the trial a patient with the following characteristics: 57 years old; stage 3; time interval 22 months; postmenopausal. The numbers of women with this patient's characteristics already in the two treatment groups are shown in Table 15.3. As we wish to have the two groups as similar as possible, the preferable treatment for the new patient is that with the smaller total. Here we would use weighted randomization with a weighting in favour of talc.

患者分配后更新各组人数,并对下一位患者重复此过程。若某患者两组人数相同,则采用简单(无权重)随机化,如首位患者。该方法可简单推广至多类别变量及多于两种治疗的试验。
After the patient is allocated to a treatment the numbers in each group are updated and the process is repeated for the next patient. If for any patient the totals for the two treatments are the same, then the choice should be made using simple (unweighted) randomization, as it is for the first patient. The method extends simply to variables with more than two categories and to trials of more than two treatments.

表15.2 使用最小化方法分配治疗的临床试验中首29名患者的特征
Table 15.2 Characteristics of the first 29 patients in a clinical trial using minimization to allocate treatments

Mustine (n = 15)滑石粉 (n = 14)
年龄≤ 5076
> 5088
分期1 或 2 期1111
3 或 4 期43
时间间隔≤ 30 个月64
> 30 个月910
绝经状态绝经前75
绝经后89
Mustine (n = 15)Talc (n = 14)
Age≤ 5076
&gt; 5088
Stage1 or 21111
3 or 443
Time interval≤ 30 m64
&gt; 30 m910
Menopausal statusPre75
Post89

表 15.3 第三十位患者治疗分配中患者特征不平衡的计算
Table 15.3 Calculation of imbalance in patient characteristics for allocating treatment to the thirtieth patient

Mustine (n = 15)滑石粉 (n = 14)
年龄> 5088
分期3 或 4 期43
时间间隔≤ 30 个月64
绝经后89
总计 2624
Mustine (n = 15)Talc (n = 14)
Age&gt; 5088
Stage3 or 443
Time interval≤ 30 m64
Postmenopausal89
Total 2624

治疗分配中可以省略随机成分,使每位患者自动接受能减少不平衡的治疗。尽管某位患者接受的治疗复杂地依赖于已入组患者的特征,但缺少随机因素会带来小概率的选择偏倚。因此,优选使用加权随机化。
The random component can be omitted from the allocation of treatments, so that each patient is automatically given the treatment which leads to less imbalance. Although the treatment that a particular patient receives depends in a complicated way upon the characteristics of the patients already entered into the trial, the absence of a random element introduces a small possibility of selection bias. It is preferable therefore to use weighted randomization.

最小化方法是普通随机化的有效替代方案,尤其在小样本试验中具有重要优势,即各组间在用于分配的变量上差异极小。该方法特别适合借助计算机程序执行,但若在每位新患者入组后更新各组患者特征计数,也不难手工操作。
Minimization is a valid alternative to ordinary randomization, and it has the important advantage, especially in small trials, that there will be only minor differences between the groups with respect to those variables used in the allocation process. It is particularly suitable to be performed with the aid of a computer program, but it is not difficult to perform 'by hand' if the record of the numbers of patients with each characteristic in each group is updated after each new patient has entered the trial.

15.2.4 其他治疗分配方法 15.2.4 Other methods of treatment allocation

随机分配的替代方法可分为系统性(或伪随机)方法和非随机方法。非随机试验又可细分为同期对照和非同期(或历史)对照。
Alternatives to random allocation may be divided into systematic (or pseudo- random) methods and non- random methods. Non- randomized trials can be further divided into those with concurrent or non- concurrent (or historical) controls.

(a) Systematic allocation

一种常见的方法是根据患者的出生日期或入组试验的日期分配治疗(例如,给偶数日期的患者分配治疗A,奇数日期的患者分配治疗B),或者根据医院号码的末位数字,或者简单地交替分配到不同的治疗组。虽然原则上这些方法都是无偏的,但由于分配系统的公开性,问题随之而来。简单来说,分配被有权访问该程序的人更改是一个众所周知的现象。此外,知道患者将接受哪种治疗可能会影响是否将该患者纳入试验的决定。虽然这些行为通常出于利他动机,但结果是分配存在偏倚,且数据很可能毫无价值。
A common approach is to allocate treatments to patients according to the patient's date of birth or date of enrolment in the trial (such as giving treatment A to those with even dates, and treatment B to those with odd dates), by the terminal digit of the hospital number, or simply alternately into the different treatment groups. While all of these approaches are in principle unbiased, problems arise from the openness of the allocation system. Put crudely, it is a well- known phenomenon for the allocation to be altered by someone with access to the procedure. Further, knowledge of which treatment a patient is destined to receive can affect the decision about whether to enter that patient into the trial. While such actions are often taken for altruistic motives, the result is a biased allocation and quite possibly a worthless set of data.

尽管系统分配看似无偏,但容易被滥用,除非确实没有其他选择,否则不推荐使用。“伪随机”这一术语具有误导性,因为该方法没有随机成分,且明显不及真正的随机分配。
Although systematic allocation appears unbiased, it is open to abuse and cannot be recommended unless there really is no alternative. The term 'pseudo- random' is misleading, as there is no random element and the method is definitely inferior to true random allocation.

(b) Non-random concurrent controls

使用非随机对照会导致解释上的问题,因为通常无法确定各组是否可比。实际上,各组可能在已知方面存在差异,但其影响未知。例如,在关于维生素补充剂与安慰剂对神经管缺陷的试验中(Smithells 等,1980,后文将进一步讨论),对照组包括不符合试验资格的女性以及拒绝参与的女性。许多研究表明存在志愿者偏倚,志愿者通常预后优于拒绝者。只要不同治疗组之间存在系统性差异,例如患者来自不同医院,我们就应担心偏倚。由临床医生根据情况决定治疗的研究尤其不可靠。
The use of non- random controls leads to problems of interpretation, because it will usually be impossible to establish that the groups are comparable. Indeed, the groups may specifically differ in known ways but with unknown effect. For example, in the trial of vitamin supplementation versus placebo in relation to neural tube defects (Smithells et al., 1980). discussed further below, the control group included women ineligible for the trial as well as women who refused to participate. Many studies have shown that there is a volunteer bias, with volunteers usually having a better prognosis than refusers. We should worry about bias whenever there is a systematic difference between the patients given different treatments, for example when the groups are taken from patients at different hospitals. Studies where the treatments are given as deemed appropriate by the clinician are especially unreliable.

(c) Historical controls

评估新治疗最简单的方法可能是将接受新治疗的单一患者组与先前接受另一种治疗的患者组进行比较。这通常是同一家医院的两个连续患者系列。尽管有少数支持者,但这种方法存在严重缺陷,因为我们永远无法令人满意地排除随时间变化的其他因素带来的偏倚。
Probably the simplest approach to evaluating a new treatment is to compare a single group of patients all given the new treatment with a group previously treated with an alternative treatment. Often these will be two consecutive series of patients in the same hospital(s). Despite a few

Pocock(1977)显示,在同一机构连续进行的19例癌症化疗试验中,观察到的死亡率变化范围从。尽管部分变化可能因样本量小,但其中四个差异在2%显著性水平上具有统计学意义。Sacks 等(1983)比较了使用随机对照和历史对照的同类治疗试验,发现历史对照试验倾向于给出更乐观的结果。仅在严格控制的罕见病情情况下,如评估晚期癌症治疗,使用历史对照才有正当理由。
advocates, this approach is seriously flawed as we can never satisfactorily eliminate possible biases due to other factors that may have changed over time. Pocock (1977) showed that in 19 cases where the same therapy was used in two consecutive trials of cancer chemotherapy in the same institution there were large changes in the observed death rates, ranging from to . While some of the variation was probably due to small sample sizes, four of the differences were statistically significant at the level. Sacks et al. (1983) compared trials of the same therapies in which randomized or historical controls were used, and found a consistent tendency for historically controlled trials to yield more optimistic results than randomized trials. The use of historical controls can only be justified in tightly controlled situations of relatively rare conditions, such as in evaluating therapies for advanced cancer.

目前,随机试验的观点已占主导地位,非随机试验的结果可能引发重大争议。一个近期例子是关于高风险孕妇在受孕时补充维生素是否有益的研究(Smithells 等,1980)。他们发现维生素组的神经管缺陷婴儿较安慰剂组少,但因研究非随机,结果未被广泛接受,医学研究委员会现正开展大型随机试验以获得确切答案。
The balance of opinion has now swung so far towards randomized trials that the results of non- randomized trials may cause major controversy. A recent example was the study of the possible benefit of vitamin supplementation at the time of conception in women at high risk of having a baby with a neural tube defect (NTD) (Smithells et al., 1980). They found that the vitamin group subsequently had fewer NTD babies than the placebo control group, but because the study was not randomized the findings are not widely accepted, and the Medical Research Council is now running a large randomized trial to try to get a proper answer to the question.

15.2.5 替代设计 15.2.5 Alternative designs

临床试验最简单的设计称为平行组设计,即同时研究两个不同的患者组。这是本章迄今为止隐含的设计。最常见的替代设计是交叉设计,下面将介绍该设计及一些值得了解的其他较少见设计。
The simplest design for a clinical trial is called the parallel group design, in which two different groups of patients are studied concurrently. This is the design that has been implicit in this chapter so far. The most common alternative is the crossover design, which is described below together with some other less common designs that are worth knowing about.

(a) Crossover design

交叉试验是指同一组患者依次接受所有感兴趣的治疗。随机分配用于决定治疗顺序。交叉设计具有一些优点,尤其是治疗比较是“组内”而非“组间”,且所需样本量较小。然而,也存在一些重要缺点,以下以两期交叉试验为例说明:
A crossover trial is one in which the same group of patients are given both (or all) treatments of interest in sequence. Here randomization is used to determine the order in which the treatments are received. The crossover design has some attractive features, in particular that the treatment comparison is 'within- subject' rather than 'between- subject', and that the sample size needed is smaller. There are some important disadvantages, however, which I shall describe in relation to a two- period crossover trial:

【1】患者可能在接受第一期治疗后退出,未接受第二期治疗。退出可能与副作用有关。治疗期应尽量短,以减少因其他原因导致的退出风险。

  1. Patients may drop out after the first treatment, and so not receive the second treatment. Withdrawal may be related to side-effects. Treatment periods should be fairly short to minimize the risk of drop-out for other reasons.

  2. 可能存在治疗效果从一个时期延续到下一个时期的情况,因此第二个治疗期间获得的结果会受到第一个时期发生情况的影响。换句话说,观察到的治疗差异将取决于接受治疗的顺序。在存在这种治疗-时期交互作用的情况下,可能需要丢弃第二时期的数据,这会严重削弱试验的统计效能。

  3. There may be a carry-over of treatment effect from one period to the next, so that the results obtained during the second treatment are affected by what happened in the first period. In other words, the observed difference between the treatments will depend upon the order in which they were received. In the presence of such a treatment-period interaction the data for the second period may have to be discarded, severely weakening the power of the trial.

  4. 试验的两个时期之间可能存在某些系统性差异。例如,第二时期的观察值可能比第一时期略低,无论治疗如何。小的时期效应并不严重,因为它对两种治疗的影响是相等的。

  5. There may be some systematic difference between the two periods of the trial. For example, the observations in the second period may be somewhat lower than those in the first period, regardless of treatment. A small period effect is not too serious, as it applies equally to both treatments.

  6. 交叉设计不能用于可以治愈的疾病,且最适用于治疗效果能够迅速评估的情况。

  7. Crossover studies cannot be used for conditions which can be cured, and are most suitable when the effect of the treatment can be assessed quickly.

理想情况下应事先确认不存在治疗效果的延续影响,但相关信息可能无法获得。有时会在治疗期之间引入洗脱期以试图消除延续效应。鉴于上述问题,交叉设计可能被过度使用。Woods 等人(1989)对此有进一步讨论。
It is desirable to establish in advance that there will not be any carry- over treatment effect, but the information may be unavailable. A wash- out period is sometimes introduced between the treatment periods to try to eliminate carry- over effects. Because of the problems described, crossover studies are probably overused. Further discussion is given by Woods et al. (1989).

交叉试验的分析方法在第15.4.10节中进行了说明和示例。
The analysis of crossover trials is explained and illustrated in section 15.4.10.

(b) Within group (paired) comparisons

另一种组内设计是同时在同一受试者身上研究交替治疗。此设计适用于可以独立施用于解剖匹配部位(如肢体或眼睛)的治疗。匹配设计具有交叉设计的所有优点,但没有其缺点,因此是一种非常有力的设计。不幸的是,适用场景较少。
Another type of within group design is when alternative treatments are investigated in the same subjects at the same time. It can be used for treatments that can be given independently to matching parts of the anatomy, such as limbs or eyes. The matched design has all the advantages of the crossover design, but none of the disadvantages, so is a very powerful design. Unfortunately, there are few circumstances in which it can be used.

与配对组内设计最接近的是匹配对设计,其中成对受试者根据年龄、性别及某些预后因素进行匹配,然后随机分配两种治疗给这对受试者。该设计仅在存在可供入组的受试者库时易于使用,以便找到匹配对。对于已知的重要预后变量,该设计减少了受试者间的变异,确保接受每种治疗的受试者特征非常相似。
The nearest equivalent to the paired within subject design is the matched pairs design, where pairs of subjects are matched for, say, age, sex and certain prognostic factors, and the two treatments are then allocated to the pair of subjects at random. This design can only be used easily when there is a pool of subjects that can be entered into the trial, in order to be able to find matched pairs. Where there are known important prognostic variables the design removes much of the between subject variation, and ensures that the subjects receiving each treatment have very similar characteristics.

(c) Sequential designs

另一种设计类型是序贯试验,其中研究平行组,但试验持续进行,直到观察到一种治疗明显优于另一种,或不太可能出现差异为止。
Another type of design is the sequential trial, in which parallel groups are

序贯试验的主要优点是当两种治疗效果差异较大时,试验时间会比固定长度试验更短。
studied, but the trial continues until a clear benefit of one treatment is seen or it is unlikely that any difference will emerge. The main advantage of sequential trials is that they will be shorter than fixed length trials when there is a large difference in the effectiveness of the two treatments.

在序贯试验中,每当有患者结果可用时都会进行数据分析。因此,其使用仅限于结果能够较快得知的情况。此设计存在盲法问题(见第15.2.6节),可能还存在伦理上的困难。
In sequential trials the data are analysed after each patient's results become available. Their use is therefore restricted to conditions where the outcome is known relatively quickly. There are problems with blinding (see section 15.2.6), and possibly also ethical difficulties.

这一原则的一个有用变体是分组序贯试验,在这种试验中,数据在每一组患者观察后进行分析,可能总共分析四到五次。这样不仅使试验的计划(尤其是时长)更容易,还能在观察到明显的治疗差异时提前终止试验。
A useful variation on this principle is the group sequential trial, in which the data are analysed after each block of patients has been seen, perhaps four or five times in all. This allows the trial to be planned more easily (regarding length) but also enables the trial to be stopped early if a clear treatment difference is seen.

在合适的情况下,序贯试验是一种良好的方法,应当更频繁地使用。
In the right circumstances sequential trials are a good method, and they should be used more frequently.

(d) Factorial designs

另一种设计类型称为因子设计,其中两个治疗方法(例如A和B)同时相互比较并与对照组比较。患者被分为四组,分别接受对照治疗、仅A治疗、仅B治疗以及同时接受A和B治疗。该设计允许研究A和B之间的交互作用(或“协同效应”)。因子设计在临床试验中较少使用,但Pocock(1983,第139页)描述了一些应用实例。
One further type of design is called the factorial design, in which two treatments, say A and B, are simultaneously compared with each other and with a control. Patients are divided into four groups, who receive the control treatment, A only, B only, and both A and B. This design allows the investigation of the interaction (or 'synergy') between A and B. The factorial design is rarely used in clinical trials, but Pocock (1983, p. 139) describes some examples of its use.

(e) Adaptive designs

伦理考虑促使一些人主张采用自适应设计,在这种设计中,接受较差治疗的受试者比例随着试验的进行而减少。换句话说,患者的治疗在某种程度上取决于之前试验患者的治疗结果。除了一些实际困难,如需要快速得知每位患者的结果外,这种设计是否解决了伦理问题仍有疑问。自适应设计很少被采用。
Ethical considerations have led some people to advocate adaptive designs, in which the proportion of subjects getting the inferior treatment diminishes as the trial proceeds. In other words, a patient's treatment depends to some extent on the outcome of treatment in previous patients in the trial. Apart from practical difficulties, such as needing to know quickly the results from each patient, it is questionable whether this design resolves any ethical problems. Adaptive designs have rarely been used.

(f) Zelen's design

最后,Zelen(1979)提出了一种随机试验的变体,似乎避免了获得知情同意相关的问题。半数受试者被随机分配接受标准治疗,并且被视为未参与试验。另一半受试者被提供新的实验治疗,但他们可以选择接受标准治疗。Zelen提议的一个关键特征是(Zelen,1979),无论第二组患者实际选择了哪种治疗,两个组都按最初的随机分组进行分析。虽然该设计有一些有用的特点,但只有在接受新治疗的比例很高时才有价值,而这一点事先无法确定。
Lastly, Zelen (1979) proposed a variation on the randomized trial that seems to avoid problems associated with getting informed consent. Half of the subjects are allocated at random to receive the standard treatment, and are treated as if they were not in a trial. The other half are offered the new experimental treatment, but they can choose to have the standard treatment if they wish. An essential feature of Zelen's proposal (Zelen, 1979) is that the two groups are analysed as originally randomized, regardless of which treatment those in the second group actually opted for. While this design has some useful features, it can only be of value when a high proportion of those offered the new treatment take it, which cannot

这种设计很少被采用,许多人认为不告知半数患者他们参与试验是不道德的。另一种变体是两组都被告知所分配的治疗,并被给予转换治疗的机会。虽然解决了伦理难题,但由于缺乏盲法和如果过多患者选择转换治疗会导致效能降低,这种设计存在潜在问题。总体来看,这种设计并不被推荐。Ellenberg(1984)对此有进一步讨论。
be known in advance. This design has rarely been used, and many consider it unethical not to tell half the patients that they are in a trial. A variation is where both groups are told which treatment they have been allocated and are offered the chance to switch to the other. While resolving the ethical difficulty, there are possible difficulties associated with the necessary lack of blindness and loss of power if too many patients opt to change treatment. There does not seem to be much to recommend this design. Ellenberg (1984) gives further discussion.

15.2.6 盲法 15.2.6 Blindness

成功的临床试验关键在于避免组间比较的任何偏倚。随机化解决了治疗分配时的潜在偏倚,但偏倚也可能在试验进行中出现。患者和医生可能因知道所给予的治疗而在反应和观察上产生影响。因此,理想情况下,患者和评估者均不应知道所给予的治疗,这样的试验称为双盲试验。如果只有患者不知道,有时称为单盲试验。在某些领域,如外科,常常无法做到双盲。临床试验应尽可能采用最高程度的盲法。
The key to a successful clinical trial is to avoid any biases in the comparison of the groups. Randomization deals with possible bias at the treatment allocation, but bias can also creep in while the study is being run. Both the patient and the doctor may be affected in the way they respectively respond and observe by knowledge of which treatment was given. For this reason, it is desirable that neither the patient nor the person evaluating the patient knows which treatment was given. Such a trial is called double- blind. If only the patient is unaware, as is sometimes the case, the trial is called single- blind. In several fields, such as surgery, it is often impossible for a study to be double- blind. Clinical trials should use the maximum degree of blindness that is possible.

此外,治疗分配系统应设计成录入患者信息的人事先不知道下一位患者将接受哪种治疗。常用方法是使用一系列连续编号的密封不透明信封,每个信封内含治疗方案。对于分层随机化,需要两套或更多套信封。药物试验中,分配可能由药房执行,药房会提供编号瓶子,瓶子上不标明所含治疗。
In addition, the treatment allocation system should be set up so that the person entering patients does not know in advance which treatment the next person will get. A common way of doing this is to use a series of consecutively numbered sealed opaque envelopes, each containing a treatment specification. For stratified randomization, two or more sets of envelopes are needed. For drug trials the allocation may be carried out by the pharmacy, who will produce numbered bottles which do not indicate the treatment contained.

双盲试验明确要求不同治疗对患者和评估者均不可区分。对于比较两种有效药物的试验,可能需要采用双模拟技术,即每位患者同时服用一种有效药物和一种外观类似于另一种有效药物的安慰剂。
Double- blind trials clearly require that the different treatments should be indistinguishable to the patient and to whoever assesses the patient. For drug trials comparing two active treatments this may require the double dummy technique, in which each patient receives one of the active drugs and a dummy tablet that looks like the alternative active drug.

15.2.7 安慰剂 15.2.7 Placebos

当我们希望评估一种新治疗方法时,面临的一个问题是给对照组提供什么治疗。如果(且仅当)没有现有的标准有效治疗时,不给对照组任何有效治疗是合理的。然而,有两个原因使得给对照组患者提供一种惰性虚拟治疗或安慰剂,而不是不给予任何治疗,是更为理想的。首先,服用治疗的行为本身
When we wish to evaluate a new treatment for a condition there is the problem of what treatment to give to the control group. If (and only if) there is no existing standard beneficial treatment, then it is reasonable not to give the control group any active treatment. However, there are two reasons why it is desirable to give the control group patients an inert dummy or placebo treatment, rather than nothing. Firstly, the act of taking

某些治疗本身可能对患者有一定益处,因此如果对照组什么都不给予,那么治疗组观察到的任何益处部分可能来自于他们知道或相信自己接受了治疗。这被称为安慰剂效应。其次,为了使研究达到双盲标准,两个治疗必须无法区分。因此,安慰剂药片在外观和口感上应与有效治疗药片完全相同,但在药理上无效。
some treatment may itself have some benefit to the patient, so that if we give nothing at all to the control group then part of any benefit observed in the treated group could be due to the knowledge or belief that they had taken a treatment. This is known as the placebo effect. Secondly, in order for a study to be double- blind it is necessary for the two treatments to be indistinguishable. Placebo tablets should therefore be identical in appearance and taste to the active treatment, but pharmacologically inactive.

许多临床试验实际上确实发现安慰剂组有某种表面上的治疗益处,而且通常也会出现副作用。如果没有一个比较组(该组可能接受另一种有效治疗或安慰剂),我们无法判断任何益处(或伤害)是否特异于正在研究的新治疗。例如,如果在有效治疗组和安慰剂组中报告的头痛数量相同,我们就不会将头痛视为有效治疗的副作用。
Many clinical trials do, in fact, find some apparent benefit of treatment in the placebo group, and there are often side- effects too. Without a comparison group, who may be given an alternative active treatment or a placebo, we cannot know how specific any benefit (or harm) is to the new treatment being investigated. For example, if there are as many reported headaches in the active and placebo treated groups, we would not consider headache as a side- effect of the active treatment.

安慰剂有时也可以用于非药物试验中。在第10.3节中,我描述了一个使用模拟电刺激作为对照治疗的试验。同样,针灸的对照组也可以通过在“错误”穴位插针来轻松设置。然而,侵入性安慰剂可能存在伦理问题。
Placebos can sometimes be used in non- drug trials too. In section 10.3 I described a trial that had used mock electrical stimulation as a control treatment. Likewise a control for acupuncture is easily set up by having needles inserted at the 'wrong' points. There may however be ethical problems associated with invasive placebos.

15.2.8 受试者的选择 15.2.8 Selection of subjects

临床试验是一个典型例子,说明我们从样本中收集数据,并利用分析结果对所有此类受试者的总体进行推断。为了使这一过程有效,显然需要选择具有代表性的样本。然而在实际操作中,通常会对谁有资格参加试验施加许多限制,因此将结果外推到总体可能会很困难。例如,一项安慰剂对照试验在英国医生中进行,旨在观察每日服用阿司匹林是否能降低中风、心肌梗死及其他血管疾病的发生率和死亡率(Peto 等,1988)。研究者确定了20000名愿意参与的医生,但其中近四分之三因各种原因不符合资格—要么他们已经因某些原因在服用阿司匹林,要么因有服用阿司匹林的禁忌症,或者有消化性溃疡、中风或心肌梗死的病史。因此,参与研究的医生是一群健康状况较好的选定人群。
Clinical trials are a prime example of the principle that we collect data from a sample and use the results of the analysis to make inferences about the population of all such subjects. In order for this process to work it is clearly necessary to select a representative sample. In practice, however, many restrictions are usually placed on who is eligible to take part in a trial, and so extrapolation of the results to the population may be difficult. For example, a placebo- controlled trial was carried out in British doctors to see if daily aspirin would reduce the incidence of and mortality from stroke, myocardial infarction and other vascular conditions (Peto et al., 1988). The investigators identified 20000 doctors who were willing to participate but almost three- quarters of them were ineligible, either because they were already taking aspirin for some reason, because there were reasons why they could not take aspirin, or because they had a history of peptic ulcer, stroke or myocardial infarction. The doctors who took part in the study were therefore a selected group of more healthy individuals.

图15.1展示了低血压或高血压患者如何因为不同原因被排除在新型降压治疗试验之外。通常,伦理上合理且适合纳入试验的患者是那些如果治疗有效最有可能受益的患者。一般而言(但并非总是如此),我们不期望治疗对预后极佳的患者或预后极差的患者有显著作用。
Figure 15.1 shows how patients with either low or high blood pressure would be excluded from a trial of a new hypertensive treatment, although for different reasons. Often the patients whom it is both ethical and reasonable to include in a trial are those most likely to benefit if the treatment is effective. In general, but not always, we do not expect treatment to do much for patients who already have an excellent prognosis, nor for those with a dreadful prognosis.


图15.1 新型降压药物试验中患者资格的示意图(基于 Elwood,1982)。
Figure 15.1 Diagram showing the eligibility of patients for a trial of a new antihypertensive agent (based on Elwood, 1982).

Begg 和 Engstrom(1987)讨论了癌症临床试验中过于严格的资格标准问题,这可能导致大多数患病患者不符合试验资格。他们认为许多排除标准是不必要的。排除标准越严格,试验结果的推广性越差。尤其在大型试验中,不宜过于严格;尽管在小型试验中,如果采用简单随机化,保持受试者的同质性可能有一定优势。
Begg and Engstrom (1987) discussed the problem of over- restrictive eligibility criteria in cancer clinical trials, which can lead to most patients with a disease being ineligible for a trial. They suggest that many exclusion criteria are unnecessary. The more restrictive the exclusion criteria, the less generalizable will be the results of the trial. In large trials especially it is better not to be too restrictive, although in small trials there may be some advantage in keeping the study subjects more homogeneous, especially if simple randomization is used.

15.2.9 伦理问题 15.2.9 Ethical issues

临床试验是对人体进行的实验,因此临床试验涉及若干重要的伦理问题不足为奇。其中之一是关于给予患者的信息量。一般来说,应邀请患者参加试验,并告知他们可选择的替代治疗方案(尽管他们通常不知道自己将接受哪种治疗)。患者可以拒绝参加试验,这种情况下他们将接受常规治疗。如果同意参加,通常需要签署一份表明他们理解试验内容的同意书。知情同意具有争议,因为许多患者可能并未真正理解所告知的内容,且他们并不总是被告知应有的全部信息。在某些情况下,无法获得知情同意,例如患者年龄非常小、非常大或处于昏迷状态时。此外,在某些情况下,获得患者同意随机分组可能很困难,比如比较乳腺切除术与化疗治疗乳腺癌的试验。
A clinical trial is an experiment on human beings, so it is not surprising that there are several important ethical issues relating to clinical trials. One concerns the amount of information given to the patient. In general the patient should be invited to be in the trial, and should be told what the alternative treatments are (although they will usually not know which they will get). They can decline to be in the trial, in which case they will be treated normally. If they agree to participate they will often have to sign a form stating that they understand the trial. This informed consent is controversial, because it is likely that many patients do not really understand what they are told, and that they are not always told as much as they should be. There are some cases where it is not possible to get informed consent, for example when the patients are very young, very old, or unconscious. Also there are a few circumstances where it might be difficult to get people to agree to be randomized, such as in a trial comparing mastectomy with chemotherapy as a treatment for breast cancer.

从临床角度看,如果医生认为正在研究的某种治疗优于其他治疗,则不应参与该临床试验,也不应让认为某种治疗更适合的患者参加试验。换句话说,理想的医学状态是无知状态:进行试验是因为我们不知道哪种治疗更好。有人可能认为,活性治疗在无对照的观察性研究中显示出有希望的结果,因此肯定优于安慰剂,但事实并非总是如此。此外,即使治疗有效,也可能存在不可接受的副作用。
From the clinical side, no doctor should participate in a clinical trial if he/she believes that one of the treatments being investigated is superior, and they should not enter any patient for which they think that a particular treatment is indicated. In other words, the ideal medical state to be in is one of ignorance: the trial is carried out because we do not know which treatment is better. It may be thought that an active treatment, which will have yielded promising results in uncontrolled observational studies, would be certain to be better than a placebo, but this is not always so. Further, even if the treatment is beneficial there may be unacceptable side- effects.

许多国家设有大量伦理委员会,负责审查临床试验(以及任何涉及人体的研究)提案。有趣的是,围绕围绕受孕期维生素补充剂试验设计的问题(Smithells 等,1980)主要源于伦理委员会拒绝批准最初提出的随机试验方案。当然,最终实施的研究得到了伦理委员会的批准。伦理委员会通常只关注患者福利,不涉及科学问题,包括统计学问题。
In many countries there are a large number of ethics committees set up to consider proposals to carry out clinical trials (and, indeed, any research involving human subjects). Interestingly, the problems relating to the design of the trial of vitamin supplementation around conception (Smithells et al., 1980) stemmed largely from the refusal of ethics committees to sanction the randomized trial that was originally proposed. The study as performed did, of course, have the approval of the ethics committees. Ethics committees are usually concerned only with the welfare of the patient, and do not consider scientific, including statistical, issues.

在设计方面,可以认为非随机试验,尤其是使用非同时(历史)对照的试验是不道德的,因为如前所述,这类试验结果极不可靠。类似的批评也适用于任何采用次优方法的试验,尽管无法明确划定伦理与非伦理研究的界限。
Regarding design, it can be argued that non- randomized trials, especially those with non- concurrent (historical) controls, are unethical because, as shown earlier, the results of such trials are so unreliable. Similar comments can be levelled at any trial which uses suboptimal methodology, although it is not possible to draw a precise line between ethical and unethical studies.

更广泛地说,任何使用低标准统计方法的研究(不一定是临床试验),尤其是在设计或分析方面,可能被视为不道德,原因有三(Altman,1982a):
More generally, any study (not necessarily a clinical trial) that uses substandard statistical methods, especially in design or analysis, may be deemed unethical for three reasons (Altman, 1982a):

1.滥用患者,使其承受无正当理由的风险和不便;

  1. the misuse of patients by exposing them to unjustified risk and inconvenience;
    2.浪费资源,包括研究者的时间,这些时间本可用于更有价值的活动;
  2. the misuse of resources, including the researchers' time, which could be better employed on more valuable activities; and
    3.发表误导性结果的后果,可能导致进行不必要的后续研究。
  3. the consequences of publishing misleading results, which may include the carrying out of unnecessary further work.

许多关于临床试验伦理问题的讨论见于 Bradford Hill(1963)。Silverman(1985,第153页)对主要问题进行了较新的综述。
Many of the ethical issues relating to clinical trials were dealt with by Bradford Hill (1963). Silverman (1985, p. 153) gives a more recent review of the main issues.

15.2.10 结局指标 15.2.10 Outcome measures

在大多数临床试验中,关于治疗效果的信息是针对许多变量收集的,有时是在多个时间点。人们容易陷入分析每个变量并寻找治疗组间显著差异的诱惑。这种方法会导致误导性结果,因为多重检验会使假设检验的结果失效。
In most clinical trials information about the effect of treatment is gathered in relation to many variables, sometimes on more than one occasion. There is the temptation to analyse each of the variables and look to see which differences between treatment groups are significant. This approach leads to misleading results, because multiple testing will invalidate the results of

特别是,仅展示最显著的结果,仿佛这些是唯一进行的分析,是欺诈行为。
hypothesis tests. In particular, presenting only the most significant results, as if these were the only analyses performed, is fraudulent.

更可取的方法是在分析前预先确定主要关注的结局指标,并在数据分析时重点关注该变量。其他数据也可以且应当进行分析,但这些变量应被视为次要。次要变量中的任何有趣发现应谨慎解读,更应作为进一步研究的思路,而非确定性结论。治疗的副作用也应以此方式处理。
A preferable approach is to decide in advance of the analysis which outcome measure is of major interest, and focus attention on this variable when analysing the data. Other data can and should be analysed too, but these variables should be considered to be of secondary importance. Any interesting findings among the secondary variables should be interpreted rather cautiously, more as ideas for further research than as definitive results. Side- effects of treatment should be treated in this way.

有时确实会有多个主要结局指标。如果有两个,分析这两个指标通常不会带来太大问题,或许可以采用更严格的统计显著性阈值。有时可以将两个变量合并为一个,尤其当关注的变量是替代事件时,如死亡或心脏病发作。
Sometimes there really will be more than one major outcome measure. If there are two, then no great harm will come from analysing them both, perhaps taking a stricter cut- off for statistical significance. Sometimes it is possible to combine two variables into one, in particular when the variables of interest are alternative events, such as death or heart attack.

最后,注意样本量计算(见第15.3节)是基于单一变量的。
Finally, note that sample size calculations (see section 15.3) are based on a single variable.

15.2.11 研究方案 15.2.11 Protocols

规划临床试验的重要方面是制定研究方案,研究方案是一份正式文件,概述了实施试验的拟定程序。
An important aspect of planning a clinical trial is to produce a protocol, which is a formal document outlining the proposed procedures for carrying out the trial.

Pocock(1983,第28-31页)建议研究方案应包括以下主要内容:
Pocock (1983, pp. 28- 31) suggests the following main features of a study protocol:

  1. 背景和研究目标

  2. background and study objectives

  3. 具体目标

  4. specific objectives

  5. 患者选择标准

  6. patient selection criteria

  7. 治疗方案

  8. treatment schedules

  9. 患者评估方法

  10. methods of patient evaluation

  11. 试验设计

  12. trial design

  13. 患者登记与随机分组

  14. registration and randomization of patients

  15. 患者知情同意

  16. patient consent

  17. 研究所需样本量

  18. required size of study

  19. 试验进展监控

  20. monitoring of trial progress

  21. 表格与数据处理

  22. forms and data handling

  23. 方案偏差

  24. protocol deviations

  25. 统计分析计划

  26. plans for statistical analysis

  27. 行政职责

  28. administrative responsibilities.

申请试验经费时必须提交方案,大部分上述信息也需提供给当地伦理委员会。此外,方案不仅有助于试验的实施,还能使结果撰写更为简便,因为论文的引言和方法部分应基本与上述第1至9节内容一致。
A protocol is necessary when applying for a grant to carry out a trial. and most of the above information will be required by the local ethics committee. Further, as well as aiding in the carrying out of a trial. a protocol makes the writing up of the results much easier as the introduction and methods section of the paper should be substantially the same as sections 1 to 9 above.

对于多中心研究,详细的方案是必不可少的,且强烈建议任何临床试验都应制定方案。事实上,我建议任何研究项目都应制定正式方案—上述大多数类别并非临床试验特有。
For multicentre studies a detailed protocol is essential, and it is strongly recommended for any clinical trial. Indeed, I recommend the drawing up of a proper protocol for any research project - most of the above categories are not specific to clinical trials.

15.3 样本量 15.3 SAMPLE SIZE

15.3.1 引言 15.3.1 Introduction

在第8.5.3节中,我介绍了与假设检验相关的检验效能(power)概念。检验效能是指在给定样本量的研究中,能够以统计学显著性检测到某一真实差异的概率。医学文献中存在许多样本量过小的试验,这些试验几乎无法有效检测治疗间临床有意义的差异。从多篇已发表试验的综述中可以明显看出,大多数试验在设计时未进行适当的样本量统计计算。除非真实治疗效应较大,否则小样本试验只有在样本中观察到的差异远大于真实差异时,才可能出现统计学显著结果。
In section 8.5.3 I introduced the concept of power in relation to hypothesis testing. The power of a test is the probability that a study of a given size would detect as statistically significant a real difference of a given magnitude. The medical literature contains many trials that were far too small to have a good chance of detecting clinically worthwhile differences between the treatments being investigated. It is clear from many reviews of published trials that the majority have been carried out with no statistical calculation of the appropriate sample size. Unless the true treatment effect is large, small trials can yield a statistically significant result only if, by chance, the observed difference in the sample is much larger than the real difference.

本节介绍用于计算比较两组独立受试者(平行组设计)或比较配对观察(配对或交叉设计)所需适当样本量的统计方法。这些方法不限于随机试验,而适用于一般的两组比较。尽管该方法存在一定的人工成分,但远优于常见的盲目试错方法。计算基于假设检验的原理。对于更复杂的试验(包括序贯试验)以及主要结局为生存时间的试验,样本量计算需要统计学支持。
This section introduces statistical methods for calculating the appropriate sample size for comparing two independent groups of subjects (parallel group design), or for comparing paired observations (paired or crossover design). These methods are not specific to randomized trials, but apply to two group comparisons in general. While there is some artificiality in the approach, it is vastly preferable to the hit and miss approach that is so common. The calculations are based on the principles of hypothesis testing. Sample size calculations for more complicated trials, including sequential trials, will require statistical assistance, as will those where the main outcome of interest is survival time.

15.3.2 样本量、假设检验与检验效能 15.3.2 Sample size, hypothesis tests and power

如果我们能够确定治疗之间最小的、具有临床价值的真实差异,就可以利用假设检验的检验力来计算临床试验的合适样本量。这个要求在某种程度上较为人为且难以界定。然而在实际中,通常可以明确新治疗相较于旧治疗需要达到的效益程度,才能被认为是值得采用的治疗。
We can use the power of a hypothesis test to calculate the appropriate sample size for a clinical trial if we can specify the smallest true difference between the treatments that would be clinically valuable. It is this requirement that is somewhat artificial and difficult to define. In practice, however, it is usually possible to specify the degree of benefit that the new treatment would need to have over the old one for it to be a worthwhile treatment.

样本量计算的主要思想是,如果存在有意义的效应,能够以较高概率被统计学显著地检测出来;反之,如果试验未发现此类效益,则可以较为确信其不存在。研究的检验力越大,我们的信心越强,
The main idea behind the sample size calculations is to have a high chance of detecting, as statistically significant, a worthwhile effect if it exists, and thus to be reasonably sure that no such benefit exists if it is not found in the trial. The greater the power of the study, the more sure we

但更高的检验力需要更大的样本量,正如我们将看到的。通常要求的检验力在80%至90%之间。实际上,我们试图使临床重要性和统计显著性相一致,从而减少解释上的困难。
can be, but greater power requires a larger sample, as we will see. It is common to require a power of between and . In effect, we try to make clinical importance and statistical significance agree, and thus reduce problems of interpretation.

所需的样本量通常通过复杂的公式计算,或者查阅大量的表格(Machin 和 Campbell,1987),但使用图解法更为简便。图15.2展示了一个列线图,可用于计算本章所讨论所有情形下的合适样本量。使用起来简单,
The necessary sample size is usually obtained from complicated formulae or there are extensive tables available (Machin and Campbell, 1987), but it is much simpler to use a graphical method. Figure 15.2 shows a nomogram that can be used to calculate the appropriate sample size for all the situations considered in this chapter. It is simple to use and has the added


图15.2 用于计算样本量或检验力的列线图(经 Altman,1982b 授权转载)。
Figure 15.2 Nomogram for calculating sample size or power (reproduced from Altman, 1982b, with permission).

该方法还有一个优点,即反向使用同样简便,可用于确定给定样本量的研究的检验力。
advantage of being equally easy to use in reverse for determining the power of a study of given sample size.

我将首先考虑两组样本量相等的情况。但正如稍后所示,该列线图同样适用于样本量不等的情况。所有样本量计算均基于所谓的标准化差异。对于连续变量或分类变量,计算方法不同,但原则上均基于感兴趣差异与观测值标准差的比值。换言之,我们将感兴趣的差异表示为标准差的倍数。正如预期的那样,该比值越小,所需的试验样本量越大。
I shall first consider the case where we intend to have two groups of equal size. The nomogram can be used, however, for unequal sample sizes, as I shall show later. All of the sample size calculations are based on the quantity known as the standardized difference. This is calculated in a different way for continuous or categorical outcome variables, but in principle it is based in each case on the ratio of the difference of interest to the standard deviation of the observations. In other words, we express the difference of interest as a multiple of the standard deviation. As we would expect, the smaller this ratio is the larger the required size of the trial.

(a) Continuous data - two independent groups

对于两组独立患者且结局为连续变量的研究,我们需要指定以下量:
For studies of two independent groups of patients with a continuous outcome measure we need to specify the following quantities:

【1】变量的标准差(各组分别计算)

  1. standard deviation of the variable (in each group) ;
    【2】临床相关差异
  2. clinically relevant difference ;
  3. 显著性水平
  4. the significance level ;
  5. 检验效能
  6. the power ;

并假设该变量在总体中服从正态分布。总样本量为
and it is assumed that the variable has a Normal distribution in the population. The total sample size is .

标准化差异简单地计算为感兴趣差异与标准差的比值,即 。我们可以使用图15.2,根据任意期望的效能,从标准化差异计算所需样本量,显著性水平可选
The standardized difference is calculated simply as the ratio of the difference of interest to the standard deviation, that is . We can use Figure 15.2 to calculate the necessary sample size from the standardized difference for any desired power, choosing either a or level of significance.

例如,假设我们计划在五岁儿童中进行一项牛奶喂养试验,观察每天补充牛奶一年是否能使身高增长超过对照组。(实际上,由于实际和伦理原因,这样的研究较难实施。)根据已发表的数据,该年龄段儿童平均年增长约 ,标准差为 。假设牛奶对身高增长的影响若达到至少 即视为重要。我们希望有较高的概率检测到此差异,因此设定效能为 ,显著性水平为 。标准化差异为 。现在我们可以使用图15.2计算所需样本量。我们从标准化差异刻度上的 点画一条直线到效能刻度上的 点,然后在对应 的线段上读取总样本量 ,结果为 ,即每组
For example, suppose that we are planning a milk- feeding trial in five- year- old children, to see if a daily supplement of milk for a year will lead to an increased gain in height compared with a control group. (Such a study would in fact be difficult to carry out, for practical and ethical reasons.) We know from published data that at this age children grow on average about in a year, with a standard deviation of . Suppose that the effect of the milk on height gain will be considered important if it is at least . We want a high probability of detecting such a difference, so we set the power to be and choose a significance level. The standardized difference is . We can now use Figure 15.2 to calculate the necessary sample size. We 'draw' a straight line from the value on the scale for the standardized difference to the value on the scale for power and read off the value for on the line corresponding to , which gives a total sample size of , i.e. in each group.

如果没有标准差的估计值,有几种可能的方法。一种方法是开始试验并使用数据来
There are several possible approaches if no estimate of the standard deviation is available. One way is to start the trial and use the data for the

首批患者用于估计标准差,从而确定所需的样本量。或者,可以将问题重新定义为某一选定临界值上下比例的差异,然后使用下文描述的比例方法。例如,我们可以将一项抗高血压药物的试验重新表述为收缩压降至低于 的受试者比例差异,而不是比较平均血压。另一种可能是直接以未知标准差来指定感兴趣的差异。例如,Guyatt 等人(1987)设计了一项比较氨溴索和安慰剂在慢性支气管炎患者中的试验,他们使用问卷得出症状严重度评分。由于不知道这些评分的标准差,他们指定希望能够检测出组间差异为一个标准差。因此,标准化差异为1.0,研究者避免了必须指定标准差的需求。所有这些解决方案都涉及一定程度的主观判断。
first patients to estimate the standard deviation and thus the sample size needed. Alternatively, the problem can be redefined in terms of the difference between the proportions above and below some chosen cut- off level, and then use the methods for proportions described below. For example, we may recast a trial of an antihypertensive agent in terms of the difference in the proportion of subjects whose systolic blood pressure is reduced to below , rather than a comparison of mean blood pressure. Another possibility is to specify the difference of interest directly in terms of the unknown standard deviation. For example, Guyatt et al. (1987) set up a trial to compare ambroxol and placebo in patients with chronic bronchitis, in which they used a questionnaire to derive a score for severity of symptoms. They did not know the standard deviation of these scores, so specified that they wished to be able to detect a difference between the groups of one standard deviation. The standardized difference was therefore 1.0, and the researchers had avoided the need to specify the standard deviation. All of these solutions involve some degree of subjectivity.

计算任意输入值组合 的样本量非常容易,我们也可以通过改变输入值来调整样本量。然而,最好事先确定需求。虽然适度放宽这些要求是可以接受的,但一般来说,如果计算出的样本量超过实际可行范围,则研究可以通过延长时间或增加研究中心来扩大规模。如果无法接近所需的研究规模,最好放弃该研究。
It is easy to calculate the sample size for any combination of input values , and we can always change the sample size by altering the input values. However, it is preferable to decide in advance what the requirements are. While some modest relaxation of these is acceptable, in general if the calculated sample size exceeds what seems practical, then the study can be extended either in time or by running the study at more centres. If it is not possible to get near to the required size of study, then the study may best be abandoned.

(b) Continuous data - paired or within person studies

配对研究或个体内研究(如交叉试验)的适当样本量计算方法非常相似。主要区别在于我们使用的标准差是预期变化的标准差,我将其称为 。不幸的是,这个标准差的估计值往往不可得。如果我们有一个合理的 估计值,就可以计算标准化差异为 ,然后像之前一样使用列线图。(注意,这与独立组的公式类似,唯一区别是乘以了2。)
The appropriate sample size for paired studies, or within person studies such as crossover trials, is obtained in a very similar way. The main difference is that the standard deviation we use is the standard deviation of the changes expected, which I shall call . Unfortunately, an estimate of this standard deviation is often not available. If we do have a reasonable estimate of , we can calculate the standardized difference as , and then use the nomogram as before. (Note the similarity to the formula for independent groups, apart from the multiplier of 2. )

(c) Categorical data

图15.2中的列线图也可用于二分类结局变量的研究。如果结局变量有多于两个类别,则需要创建一个感兴趣的二分类变量。例如,如果患者被评估为“改善”、“无变化”或“恶化”,那么样本量计算可以基于患者是否改善。
The nomogram in Figure 15.2 can also be used for studies which have a binary outcome variable. If the outcome variable has more than two categories it is necessary to create a binary variable of interest. For example, if patients are to be assessed as 'improved', 'no change' or 'worse', then the sample size calculation could be based on whether or not the patient has improved.

比较比例的样本量计算利用了二项分布的正态近似,详见第8.4.3节。
The calculation of sample size for comparing proportions makes use of

它基于以下信息:
the Normal approximation to the Binomial distribution, discussed in section 8.4.3. It is based on the following information:

【1】每组中预期具有特定结局的比例();

  1. the expected proportion with the specified outcome in each group and );

【2】显著性水平(,双侧);
2. the significance level ( -two-sided);

【3】检验效能()。
3. the power

通常指定 的思路是,已有知识应能预测对照组(如 )中结局的比例,因此我们需要指定实验组中代表重要改善的结局比例。
The usual way of thinking about specifying and is that previous knowledge should allow us to predict the proportion with the outcome in the control group (say ), and so we need to specify the proportion with the outcome in the experimental group that would represent an important improvement.

给定 的具体值,我们可以计算标准化差异为
Given specified values of and we can calculate the standardized difference as

其中
where

例如,假设我们计划一项试验,比较两种帮助吸烟者戒烟的方法。一组给予新型尼古丁口香糖,另一组接受医生建议和一本小册子。根据已发表的证据,预计建议组中6个月后仍不吸烟的比例为15%。我们希望口香糖组能提高到30%。比较的比例因此为0.30和0.15。假设我们希望以85%的概率检测到这样的差异(如果确实存在),并在5%的显著性水平下认为其具有统计学意义。我们可以使用列线图计算试验所需的样本量。
For example, suppose we are planning a trial to compare two methods of helping smokers to give up smoking. One group is to be given a new kind of nicotine chewing gum and the other group will receive advice from their doctor and a booklet. On the basis of published evidence we expect that in the advice group of smokers will remain non- smokers at 6 months. We would be interested in an improvement to in the group given gum. The proportions to be compared are thus 0.30 and 0.15. Suppose that we want an probability of detecting such a difference, if it really exists, as statistically significant at the level. We can use the nomogram to work out the necessary sample size for the trial.

我们有 ,因此 。使用上述公式,标准化差异为
We have and so . Using the above formula the standardized difference is given as

或者为0.36。我们将标准化差异0.36与图15.2中的功效0.85连接,并从对应显著性水平0.05的中央轴读取试验所需的样本量,得到 。因此,为满足试验的条件,我们需要每组有140名吸烟者。
or 0.36. We connect the standardized difference of 0.36 to the power of 0.85 in the nomogram in Figure 15.2 and read off the necessary sample size for the trial from the central axis corresponding to a significance level of 0.05, which gives . To meet the conditions specified for the trial we thus need to have 140 smokers in each group.

(d) Unequal sample size

该对数图也适用于两组样本量不同的试验。有时出于需要或考虑,采用不等(加权)随机化是可取的。只要不平衡不大,功效损失也较小。
The nomogram can be used for trials in which the sample size in the two groups will be different. Sometimes it is felt desirable or necessary to use unequal (weighted) randomization. As long as the imbalance is not great the loss in power is small.

若要使用对数图规划不等组的研究,必须先假设两组样本量相等计算 ,然后计算调整后的样本量 。若 是两组样本量的比例,则所需总样本量为
To use the nomogram to plan a study with unequal groups, we must first calculate as if we were using equal groups and then calculate the modified sample size . If is the ratio of the sample sizes in the two groups, then the required total sample size is

两组样本量分别为 。例如,若希望实验组样本数是对照组的两倍,即 ,则 ,增加幅度较小;但若 ,则 ,比等样本量时增加了三分之一。
and the two sample sizes are given by and . So, for example, if we wish to put twice as many subjects on the experimental treatment than on the control, we have , and so , a fairly small increase, but for we have which is an increase of a third over equal sample sizes.

(e) Calculating power

对数图也可用于计算给定样本量下的功效。只需用直线连接样本量和标准化差异的相关数值,即可从第三个刻度读取研究的功效。
The nomogram can be used to calculate the power for a given sample size. We just connect by a straight line the relevant values for the sample size and standardized difference and read off the power of the study on the third scale.

评估不等样本量 的研究功效时,使用“有效”样本量 ,其计算公式为
To evaluate the power of a study with unequal sample sizes and we use the 'effective' sample size , which is calculated as

其中 ,且
where and

(f) Getting enough patients

样本量计算常常显示所需样本量超过单个中心的招募能力。与其进行一项统计效能较低的试验,不如尝试让其他中心参与合作开展“多中心”试验,尽管组织上的困难需要与样本量增加的好处权衡。
Often the sample size calculations reveal a required sample size that exceeds the recruiting capability of a single centre. Rather than carry out a trial that is low in power, it is often worth trying to get other centres to collaborate in a 'multicentre' trial, although there will be organizational difficulties to offset against the benefits of increased sample size.

另一个问题是患者入组速度往往远低于试验组织者的预期。这部分可能源于过于乐观,但更多是因为未能充分考虑试验入组标准的影响。限制入组条件可能导致无法达到计划的样本量,从而影响试验的有效性和结果的普适性。另一个因素是符合条件的患者中拒绝参与的比例。如果这些比例无法可靠估计,规划样本量时应适当预留余地。
A further problem is that the expected rate of accrual of patients to a trial can be much less than anticipated by the trial organizers. While this may be partly through over- optimism, it is often largely because of a failure to appreciate the effect of the trial's eligibility criteria. Restricting eligibility may lead to failure to achieve the planned sample size, and thus affect the usefulness of the trial as well as the generalizability of the results. Another factor here is the proportion of eligible patients who refuse to participate. If these rates cannot be reliably estimated, then it is prudent to make an allowance for them when planning the sample size for the trial.

许多困难可以通过先进行一项试点研究来避免,试点研究还可用于评估数据收集表格的质量,以及检查试验的后勤安排,例如每位患者的预期检查时间,这会影响每次门诊可接诊的患者数量。试点研究还可能提供更可靠的样本量估计值。
Many of the difficulties can be avoided by having a pilot study, which is also valuable for assessing the quality of the data collection forms, and for checking the logistics of the trial, such as the expected time to examine each patient which affects the number that can be seen in a session. A pilot study may also provide more reliable estimates for use in sample size calculations.

15.4 分析 15.4 ANALYSIS

众所周知,偏倚可能在治疗分配或试验执行过程中产生,但在临床试验数据分析阶段也存在若干不太为人所知的偏倚来源。
The possibility of bias entering a trial at the treatment allocation or during the execution of a trial is well known, but there are also several less well known ways in which bias can arise during the analysis of clinical trial data.

原则上,临床试验数据分析应较为直接,采用前几章介绍的相对简单的方法,如 检验和 检验。然而,临床试验分析中存在一些特殊问题。首先,我将讨论如何评估治疗组的可比性,随后探讨可能的偏倚来源,最后介绍比单纯比较两组均值或比例更复杂的分析方法。May 等人(1981)对分析中的偏倚有更全面的讨论。
In principle the analysis of clinical trial data should be straightforward, using relatively simple methods outlined in earlier chapters, such as tests and tests. There are, however, several particular problems that arise in the analysis of clinical trials. I shall first consider the assessment of whether the treatment groups are comparable, then some possible causes of bias, and lastly analyses that are more complicated than simply comparing means or proportions in two groups. A fuller discussion of bias in analysis is given by May et al. (1981).

15.4.1 入组特征比较 15.4.1 Comparison of entry characteristics

随机化是消除治疗分配偏倚的方法,但并不能保证各组特征完全相似。第15.2节讨论了保持组间相似性的方法,但大多数试验采用简单随机化,可能导致组间特征差异较大。例如,在包含36名患者的试验中,即使每组各有18名受试者,任何在半数受试者中存在的特征,有6%的概率在某一治疗组中的出现频率至少是另一组的两倍。预后变量的这种不平衡可能显著影响试验结果及其可信度。
Randomization is a method of eliminating bias in the way that treatments are allocated to patients, but it does not guarantee that the characteristics of the different groups are similar. Methods for trying to keep the groups similar were discussed in section 15.2, but most trials use simple randomization with which it is possible to produce groups with quite different characteristics. For example, in a trial including 36 patients, even when we have 18 subjects in each group, any characteristic that is present in half of the subjects has a chance of being at least twice as common in one treatment group as in the other. Such imbalance for a prognostic variable could have a marked effect on the results of the trial, and on their credibility.

临床试验数据的首要分析应是总结两组患者的入组或基线特征。重要的是要显示各组在可能影响患者反应的变量上相似。例如,我们通常希望确认不同组的年龄分布相似,因为许多结局与年龄相关。吸烟状况和疾病分期也是常被关注的变量。
The first analysis that should be carried out with data from a clinical trial is to summarize the entry or baseline characteristics of the patients in the two groups. It is important to show that the groups are similar with respect to variables that may affect the patient's response. For example, we would usually wish to be happy that the age distribution was similar in the different groups, as many outcomes are age- related. Smoking and stage of disease are other variables often looked at in this way.

比较组间基线特征的常用方法是进行假设检验,但稍加思考便可发现这并无助益(Altman,1985)。如果随机分组是公平进行的,我们知道两组治疗间的任何差异必然是偶然产生的。因此,假设检验没有意义。实际上,关注的问题是两组是否存在可能影响治疗反应的差异,这显然是临床重要性的问题,而非统计显著性。假设检验唯一的用途是判断随机分组是否公平,但这只能检测出重大失误。
The usual way of comparing the baseline characteristics of the groups is by performing hypothesis tests, but a moment's thought should suffice to see that this is unhelpful (Altman, 1985). If the randomization is performed fairly we know that any differences between the two treatment groups must be due to chance. A hypothesis test thus makes no sense. In any case the question at issue is whether the groups differ in a way that might affect their response to treatment, which is clearly a question of clinical importance rather than statistical significance. The only use of hypothesis testing is to judge whether the randomization was performed

我们期望有5%的检验在5%的显著性水平下呈现显著。
fairly, but this will only detect major failures. We expect of tests to be significant at the level.

尽管很少有试验结果能像Ueshima等人(1987年)的那样接近预期,即20次比较中只有1次在5%水平上显著,但我们不期望出现与偶然差异极大的偏离。Collins等人(1987年)举了一个极端不平衡的例子,表明随机分组不当。表15.4显示了两个参与早期乳腺癌随机试验中心中,患者分配到活性治疗组或对照组的淋巴结状态。中心2的巨大不平衡只能解释为该中心的随机分组不当,因此应忽略该中心的结果。
While few trials will give results as close to expectation as that of Ueshima et al. (1987), in which 1 of 20 comparisons was statistically significant at the level, we do not expect large discrepancies from chance. Collins et al. (1987) gave an example of extreme imbalance that is incompatible with proper randomization. Table 15.4 shows the nodal status of patients allocated to active treatment or control in two centres participating in a randomized trial in early breast cancer. The enormous imbalance in centre 2 can only be interpreted as indicating that the randomization at the centre was improper, and the results from that centre should be ignored.

表15.4 两个参与早期乳腺癌随机试验中心中,患者不同淋巴结状态分配到治疗组或对照组的人数(来源:Collins等,1987年)
Table 15.4 Number of patients with different nodal status allocated to treatment or control in two centres participating in a randomized trial in early breast cancer (from Collins et al., 1987)

中心1中心2
治疗组对照组治疗组对照组
淋巴结状态
062 (61%)65 (64%)27 (22%)63 (50%)
1-329 (28%)28 (28%)39 (31%)44 (35%)
4+11 (11%)7 (7%)53 (42%)18 (14%)
未知0 (0%)1 (1%)6 (5%)1 (1%)
总计102 (100%)101 (100%)125 (100%)126 (100%)
卡方= 2.0,自由度2卡方= 35.4,自由度2
P = 0.37P < 0.00000001
(不包括未知状态)
Centre 1Centre 2
TreatmentControlTreatmentControl
Nodal status
062 (61%)65 (64%)27 (22%)63 (50%)
1-329 (28%)28 (28%)39 (31%)44 (35%)
4+11 (11%)7 (7%)53 (42%)18 (14%)
Not known0 (0%)1 (1%)6 (5%)1 (1%)
Total102 (100%)101 (100%)125 (100%)126 (100%)
X²= 2.0 on 2 dfX²= 35.4 on 2 df
P = 0.37P &lt; 0.00000001
(Excluding not knowns)

基线变量的不平衡只有在该变量与结局变量相关时,才可能对试验的总体结果产生影响。通过适当的随机分组,大多数变量在不同治疗组中会有相似分布。如果存在一个或多个已知或疑似具有预后意义的变量未被很好平衡,我们可以检验这些变量是否确实与结局变量相关,或者如15.4.6节所述,直接在分析中进行调整。
Imbalance in a baseline variable is only potentially important, in the sense of affecting the overall result of the trial, if that variable is related to the outcome variable. With proper randomization most variables will be distributed similarly in the different treatment groups. If there are one or more variables with known or suspected prognostic importance that are not very closely balanced we can see whether those variables really are related to the outcome variable, or we can simply adjust for them in the analysis, as discussed in section 15.4.6.

15.4.2 主要分析 15.4.2 Main analysis

临床试验的主要分析是比较预先指定的结局指标在不同治疗组之间的差异。如前所述,我们可以使用第9章和第10章介绍的简单分析方法。对于独立组试验,可以使用两样本t检验、Mann-Whitney U检验或卡方检验,并构建相应的置信区间。对于配对或匹配研究,可以使用配对t检验、Wilcoxon配对检验或McNemar检验。交叉试验需要特定的分析方法,后文将详细介绍。然而,可能存在一些需要考虑的复杂因素,接下来的几节将进行讨论。
The main analysis of a clinical trial is the comparison of the pre- specified outcome measure(s) between the different treatment groups. As already

然而,可能存在一些需要考虑的复杂因素,接下来的几节将进行讨论。
noted, we can use the simple methods of analysis described in Chapters 9 and 10. For trials of independent groups we can use the two sample test, Mann- Whitney test, or test as appropriate and construct the associated confidence intervals. For paired or matched studies we can use the paired test, Wilcoxon paired test, or the McNemar test. Crossover trials require a particular form of analysis, which is described below. There are, however, various possible complicating factors that may need to be considered, which are discussed in the next few sections.

(原文第10条与第9条内容重复,保持原文一致)
There are, however, various possible complicating factors that may need to be considered, which are discussed in the next few sections.

15.4.3 不完整数据 15.4.3 Incomplete data

数据可能因多种原因不完整。例如,偶尔的实验室测量会缺失,因为采集的样本不足。重要的是要利用所有可用数据,并明确指出是否有观察值缺失。此外,有些信息可能根本未被记录。虽然看似合理地假设未记录的症状即为不存在,但这种推断通常不安全,只有在仔细考虑具体情况后才应作出。
Data may be incomplete for several reasons. For example, occasional laboratory measurements will be missing because the samples taken were inadequate. It is important to use all the data available, and to specify if any observations are missing. Also, some information may simply not have been recorded. While it may seem reasonable to assume that a particular symptom was not present if it was not recorded, such inferences are in general unsafe and should be made only after careful consideration of the circumstances.

缺失信息最重要的问题与患者在研究结束前退出有关。退出可能由临床医生决定,或许是因为副作用。另一种情况是患者迁移到其他地区,或无故未能返回。应尽力在试验结束时至少获得这些患者的一些状态信息,但仍可能存在数据缺失。一种可能的方法是对所有这些患者赋予最乐观的结局进行分析,然后再用最悲观的结局重复分析。如果两次分析结果相似,且与将这些患者简单排除后的分析结果也相近,则我们可以较为自信地接受研究结论。最常见的方法是直接排除所有此类患者,这在退出人数不多且各治疗组退出比例相似时是合理的。然而,如果某一治疗组的退出人数明显较多,试验结果将受到影响,因为退出很可能与治疗有关。
The most important problem with missing information relates to patients who drop out of the study before the end. Withdrawal may be by the clinician, perhaps because of side- effects. Alternatively, the patient may move to another area or just fail to return without reason. Efforts should be made to obtain at least some information regarding the status of these patients at the end of the trial, but some data are still likely to be missing. One possible approach is to assign the most optimistic outcome to all these patients and analyse the data, and then repeat the analysis with the most pessimistic outcome. If the two analyses yield similar results, and results also similar to those from an analysis in which these patients are simply excluded, then we can be fairly confident in the findings. The most common approach is simply to omit all such patients, which is reasonable if the number of withdrawals is not too great, and if the proportion withdrawing is similar in each treatment group. However, if there are many more withdrawals in one treatment group the results of the trial will be compromised, as it is likely that the withdrawals are treatment- related.

如果主要结局指标是某事件的发生时间,如死亡或疾病复发,那么即使患者退出,我们仍可利用部分数据(参见第13章)。
If the main outcome measure is the time to some event, such as death or recurrence of disease, then we can use some data for all patients, even those who withdraw (see Chapter 13).

15.4.4 方案违背 15.4.4 Protocol violations

在许多试验中,一些患者可能未遵循方案,可能是故意的,也可能是无意的。这里包括实际接受错误治疗(即非分配的治疗)的患者,以及不服用治疗的患者,称为不依从者。
In many trials some patients will not have followed the protocol, either deliberately or accidentally. Included here are patients who actually receive the wrong treatment (i.e. not the one allocated) and patients who do not

另外,有时在试验开始后才发现某位患者实际上并不符合试验资格。
take their treatment, known as non- compliers. Also it is sometimes discovered after the trial has begun that a patient was not after all eligible for the trial.

处理所有这些情况的唯一安全方法是保留所有随机分配的患者参与试验。因此,分析基于随机分组,称为意向治疗分析。对方案违背的任何其他处理策略都涉及主观判断,可能引入偏倚。有时对仅遵守方案的患者进行额外分析是有用的,但这不能被视为完全公平的比较。例如,排除不依从方案的患者可能导致分析偏倚。以随机分组为基础的分析必须被视为主要分析。
The only safe way to deal with all of these situations is to keep all randomized patients in the trial. The analysis is thus based on the groups as randomized, and is known as an intention to treat analysis. Any other policy towards protocol violations will involve subjective decisions and will thus create an opportunity for bias. It is sometimes useful to perform an additional analysis of only those patients adhering to the protocol, but this cannot be taken as a completely fair comparison. For example, the exclusion of patients who did not comply with the protocol may bias the analysis. The analysis of the groups as randomized must be considered the main analysis.

15.4.5 排除某些事件 15.4.5 Excluding some events

有时感兴趣的事件,如心肌梗死或死亡,发生在随机分组后但治疗开始前,或治疗尚未产生效果之前。将此类患者排除在分析之外是极不明智的,且可能引发争议。设计试验时应尽量缩短随机分组与治疗开始之间的延迟。Sackett 和 Gent(1979)对此问题进行了较详细的讨论。
Sometimes the event of interest, such as myocardial infarction or death, occurs after randomization but before the treatment has commenced, or before it could have had an effect. The exclusion of such patients from the analysis is most unwise and may well lead to controversy. It is desirable to design a trial so that there is a minimal delay between randomization and the start of treatment. Sackett and Gent (1979) discuss this problem at some length.

当关注的结局是特定原因导致的死亡(如癌症)时,会出现类似的问题。通常很难确定死亡是否确实与所治疗的疾病无关,因此通常不宜排除其他原因导致的死亡。
A similar problem arises when the outcome of interest is death from a specific cause such as cancer. It is often unclear if a death is truly unrelated to the medical condition being treated and so it is generally unwise to exclude deaths from other causes.

15.4.6 调整其他变量 15.4.6 Adjusting for other variables

如果我们怀疑试验开始时各组间观察到的差异(不平衡)可能影响了结局,可以在分析中考虑这种不平衡。表15.1显示了一个试验中患者的基线特征,组间差异明显。作者未对这些较大差异进行调整,因为它们均无统计学意义。我们无法确定这种不平衡可能产生的影响。小样本试验中,出现统计学不显著但临床上可能重要的大不平衡是常见的。(因此,对于小样本试验,简单随机化并不是理想的治疗分配方法。)
If we suspect that the observed differences (imbalance) between the groups at the start of the trial may have affected the outcome we can take account of the imbalance in the analysis. Table 15.1 showed some of the baseline characteristics of patients in a trial where the groups look markedly different. The authors did not adjust for the large differences because none of them is statistically significant. We do not know what effect the imbalance may have had. With small trials it is quite common to have large imbalances that are not statistically significant but which could well be clinically important. (For small trials, therefore, simple randomization is not a good method of treatment allocation.)

大多数临床试验基于比较两组在单一主要变量上的差异,这种统计分析较为简单。然而,我们可能希望在分析中考虑一个或多个其他变量。原因之一可能是
Most clinical trials are based on the simple idea of comparing two groups with respect to a single variable of prime interest, for which the statistical analysis is straightforward. We may, however, wish to take one or more other variables into consideration in the analysis. One reason might be that

两组在基线变量上不相似,如表15.1所示。因此,我们可以进行调整和未调整的分析。如果结果相似,则可以推断不平衡影响不大,可引用简单比较结果;如果结果不同,则应采用调整后的分析。只有当变量与结局相关时,不平衡才会影响结果。如果一个组的平均身高远低于另一组,但身高与治疗反应无关,则无关紧要。表15.1显示多个变量存在不平衡,这些变量合理推测与结局相关,因此强烈建议进行调整分析。采用某种限制性随机化方法以获得相似的组别是可取的,这样可以简化后续数据分析。
the two groups were not similar with respect to baseline variables, as in Table 15.1. We can thus perform the analysis with and without adjustment. If the results are similar we can infer that the imbalance was not important, and can quote the simple comparison, but if the results are different we should use the adjusted analysis. Imbalance will only affect the results if the variable is related to the outcome measure. It will not matter if one group is on average much shorter than the other if height is unrelated to response to treatment. Table 15.1 shows imbalance for several variables which we might reasonably suppose would be related to outcome, so an adjusted analysis is strongly indicated. The use of some form of restricted randomization that is designed to give similar groups is thus desirable as it simplifies the subsequent analysis of the data.

即使两组特征非常相似,如果事先知道某变量与预后密切相关,仍然建议进行调整。年龄常是此类变量。对已知影响结局的变量进行调整可以提高试验的统计效能,尽管提升不大,主要是通过提高估计治疗效应的精确度。同样,调整的效果可以通过与未调整分析的比较来评估。更多讨论见Altman(1985)。
Even if the groups had very similar characteristics it may still be desirable to adjust for another variable if we know in advance that the variable is strongly related to prognosis. Age is often such a variable. Adjustment for variables known to affect outcome can improve the power of the trial, although not greatly, by improving the precision with which we estimate the treatment effect. Again the effect of adjustment can be assessed by comparison with the unadjusted analysis. Further discussion is given in Altman (1985).

调整其他变量需要使用协方差分析或某种多元回归分析,详见第12章。
Adjusting for other variables requires the use of the analysis of covariance or some form of multiple regression analysis, as described in Chapter 12.

15.4.7 多重结局指标 15.4.7 Multiple outcome measures

我在15.2.10节中建议,尽可能将一个结局指标作为分析的主要关注点。可能还有其他结局指标,可以用相同方法分析,但结果给予较少强调。如果确实存在多个重要结局指标,则应将统计学显著性的值设定得比通常的5%更严格,以降低第一类错误风险。一种简单方法是使用Bonferroni校正,即若分析个变量,则将值乘以(见9.8.4节)。
I suggested in section 15.2.10 that where possible one outcome measure should be treated as the main focus of attention in the analysis. There may be other outcome measures, and these can be analysed using the same methods, but the findings given less emphasis. If there are genuinely several outcome measures of importance, then the value considered statistically significant should be made smaller than the usual to keep the risk of a Type I error small. One simple method is to use the Bonferroni correction, in which if there are variables being analysed then the values are multiplied by (see section 9.8.4).

Smith等(1987)回顾了四个主要综合性期刊(Lancet、British Medical Journal、New England Journal of Medicine和Journal of the American Medical Association)发表的66个临床试验,发现平均分析的结局指标数为22。对多重比较风险的认识较少。对196份类风湿性关节炎非甾体抗炎药试验报告的回顾(Gøtzsche,1989)发现,使用了70多种不同的结局指标,每个试验中位数为8。仅有6%的试验事先选择了主要结局变量。
Smith et al. (1987) reviewed 66 clinical trials published in four major general journals: Lancet, British Medical Journal, New England Journal of Medicine, and the Journal of the American Medical Association. They found that the mean number of outcome measures analysed was 22. Appreciation of the dangers of multiple comparisons was rare. A review of 196 reports of trials of nonsteroidal anti- inflammatory drugs in rheumatoid arthritis (Gøtzsche, 1989) found that over 70 different outcome measures were used, with a median of eight per trial. In only of trials was a main outcome variable chosen in advance.

Gøtzsche(1989)也指出了多次计数测量值或副作用的常见错误。临床试验的“抽样单位”(调查单位)是患者,因此结果应以患者为单位,而非例如关节或牙齿。
Gøtzsche (1989) also highlighted the common error of multiple counting of measurements or side- effects. The 'sampling unit' (unit of investigation) of a clinical trial is the patient, so results should relate to patients rather than, for example, joints or teeth.

15.4.8 基线变化 15.4.8 Changes from baseline

我在第5.2节中观察到,临床试验是一种纵向研究。尽管通常以研究期末患者的状态作为关注的结局,有时更合适的是以治疗前或基线测量的变化作为主要结局指标。例如,在比较抗哮喘治疗的试验中,关注点应是每位个体肺功能的改善,而非研究结束时的肺功能。这种分析的重要优势在于消除了各组之间在治疗前结局变量水平上的差异。分析基线变化时,在每个治疗组内单独进行分析(无论是假设检验还是置信区间)是误导性的。更好的方法是计算每位患者的基线变化值,然后直接比较不同组的变化。
I observed in section 5.2 that a clinical trial is a longitudinal study. Although it is common to take the patients' status at the end of the study period as the outcome of interest, sometimes it is more appropriate to take the change from the pre- treatment, or baseline, measurement as the prime outcome measure. For example, in a trial comparing anti- asthma treatments, the improvements in each individual's lung function would be the focus of attention rather than their lung function at the end of the study. This analysis has the important advantage of removing any differences between the groups with respect to pre- treatment levels of the outcome variable. When changes from baseline are analysed it is misleading to perform separate analyses (either hypothesis tests or confidence intervals) within each treatment group. A better approach is to calculate each patient's change from baseline, and then compare directly the changes in the different groups.

15.4.9 亚组分析 15.4.9 Subgroup analyses

人们常常关注哪些患者对治疗反应良好,哪些反应较差。我们可以通过对数据的子集分别分析来回答这类问题。例如,我们可能只分析男性患者、50岁以下患者或具有特定症状的患者。此类亚组分析在解释上存在与多重结局指标类似的问题。如果这些亚组分析在方案中已明确规定,进行少量亚组分析是合理的,但绝不可为了寻找显著差异而对数据进行多种不同方式的反复分析。Collins 等(1987)给出了多重亚组搜索风险的例子,他们在一项针对疑似急性心肌梗死患者的试验中发现,天蝎座出生的患者治疗获益是其他所有星座患者的四倍。
There is often interest in identifying which patients do well on a treatment and which do badly. We can answer a question like this by analysing the data separately for subsets of the data. We may, for example, re- do the analysis including only male patients, only patients less than 50, or those with a particular symptom. Subgroup analyses like these pose problems of interpretation similar to those resulting from multiple outcome measures. It is reasonable to carry out a small number of subgroup analyses if these were specified in the protocol, but on no account should the data be analysed in numerous different ways in the hope of discovering some significant comparison. An example of the dangers of searching through multiple subgroups is given by Collins et al. (1987), who showed that in a trial on patients with suspected acute myocardial infarction the benefit of treatment was four times as great for patients born under Scorpio than for patients born under all other signs put together.

在许多情况下,真正感兴趣的问题不是治疗间差异是否存在于患者的某个亚组中,而是治疗效果是否在两个或多个互补亚组之间存在差异。
因此,例如,在安慰剂对照试验中,我们可能想知道活性治疗在年轻患者中是否比老年患者更有效。
一种常见的方法是分别分析
In many cases the real question of interest is not whether the difference between the treatments is present in a subgroup of patients, but whether the treatment effect differs among two or more complementary subgroups. Thus, for example, in a placebo- controlled trial we may wish to know if the active treatment is more effective among younger patients than older patients. A common approach is to analyse separately the data for the

对年轻患者和年长患者分别进行分析并比较两个P值。这种分析基于在各组内分别进行的分析来比较两组,方法不正确。(类似情况已在上一节中描述。)正确的方法是比较两年龄组之间治疗效果的差异;换句话说,我们关注年龄与治疗之间的交互作用。无论结果变量是连续型、二分类还是生存时间,都可以在合适的多元回归模型中检验交互作用的可能性。我建议对此分析寻求专家意见。(另见Pocock,1983,第213页。)注意,这种分析更类似于观察性研究,因此我们不能从任何关联中推断因果关系。
younger and older patients and compare the two P values. This analysis makes comparisons between the two groups based on analyses carried out separately within each group, and is not a valid method. (A similar situation was described in the previous section.) The correct approach is to compare the difference between the treatments for the two age groups; in other words we look at the interaction between age and treatment. The possibility of an interaction can be examined within an appropriate multiple regression model, whether the outcome variable is continuous, binary or survival time. I recommend expert advice for this analysis. (See also Pocock, 1983, p. 213. ) Note that this analysis is more like that from an observational study, and so we cannot infer causality from any association.

15.4.10 交叉试验 15.4.10 Crossover trials

交叉试验在第15.2.5节中已有描述。这里将使用一项比较钙通道阻滞剂尼卡地平与安慰剂治疗雷诺现象的试验数据(Kahan 等,1987)来说明交叉试验的分析方法。数据表示两周内的发作次数,分别列于表15.5中,区分先服用尼卡地平后服用安慰剂的组和先服用安慰剂后服用尼卡地平的组。
Crossover trials were described in section 15.2.5. The analysis of a crossover trial will be illustrated using data from a trial comparing nicardipine, a calcium- channel blocker, and placebo in the treatment of Raynaud's phenomenon (Kahan et al., 1987). The data, representing the number of attacks in two weeks, are shown in Table 15.5 separately for the groups having nicardipine followed by placebo and vice versa.

通过计算每个受试者在两个时期观察值的差值 和平均值 ,并对每组数据取平均,如表15.5所示,分析得以简化。忽视研究设计,仅进行简单的治疗比较是不正确的。在比较治疗之前,应先进行另外两个检验。正确的分析包括三次两样本 检验或 Mann-Whitney 检验;这里使用的是 检验。(对于分类数据,我们使用 检验。)
The analysis is simplified by calculating for each subject the difference and average of the observations in the two periods, and averaging these for each group as shown in Table 15.5. It is incorrect to ignore the design of the study and just perform a simple comparison of treatments. Before comparing the treatments there are two other tests that should be carried out. The correct analysis consists of three two sample tests or Mann- Whitney tests; tests are used here. (For categorical data we use tests.)

通过两样本 检验来测试周期效应的可能性,比较两组患者在两个周期之间的差异。如果患者在某一周期表现更好没有普遍趋势,我们期望两组之间周期差异的均值大小相同但符号相反。因此,周期效应的检验是比较 的两样本 tt\bar{d}{1}- \bar{d}{2}\bar{a}{1}\bar{a}{2}tt\bar{a}{1}\bar{a}{2}(n = 10)(n = 10)(n = 10)(n = 10)tt\vec{d}{1}\vec{d}{2}tt\vec{d}{1}\vec{d}{2}t = 1.82t = 0.613\mathbf{P} = 0.09\mathbf{P} = 0.55tt = 2.154(\mathbf{P} = 0.045)t = 1.82t = 0.613\mathbf{P} = 0.09\mathbf{P} = 0.55tt = 2.154(\mathbf{P} = 0.045)95%5%0.05 < \mathbf{P} < 0.10(0.05 < \mathbf{P} < 0.10)y=0y = 0$, as in Figure 15.3(a). Figure 15.3(b) shows such a plot for the nicardipine trial, indicating both horizontal and vertical differences between the two groups, in line with the results already presented.

比较每周期开始时的基线测量值可判断洗脱期是否成功。例如,表15.6
A comparison of baseline readings taken at the start of each period can show whether the washout period was successful. For example, Table 15.6

显示了一项随机交叉试验中比较利福平与苯巴比妥治疗胆汁性肝硬化瘙痒的基线数据。显然,第1组患者在第2周期开始时的瘙痒程度较研究开始时轻微。因此,要么
shows baseline data from a randomized crossover trial comparing rifampicin with phenobarbitone for treatment of pruritus in biliary cirrhosis. It is clear that patients in the first group had less severe pruritis at the beginning of the second period than at the start of the study. Thus either

表15.6 两周期交叉试验中每周期开始前瘙痒评分分布,评分范围0(轻微)至3(严重)(Bachs等,1989)
Table 15.6 Distribution of pruritus scores, from 0 (mild) to 3 (severe) before each period in a two-period crossover trial (Bachs et al., 1989)

瘙痒评分
0123
第1组 (n = 12)
利福平前0255
苯巴比妥前*3314
第2组 (n = 10)
苯巴比妥前0226
利福平前0226
Pruritus score
0123
Group 1 (n = 12)
Before rifampicin0255
Before phenobarbitone*3314
Group 2 (n = 10)
Before phenobarbitone0226
Before rifampicin0226

*一名患者在第1周期后退出。
*One patient dropped out after period 1.

瘙痒可能已被第一次治疗改善,因此交叉试验不适用,或者洗脱期过短。一般而言,将基线测量纳入分析是有利的,但这会使分析更加复杂。
the pruritus had been improved by the first treatment, so that a crossover trial was inappropriate, or the washout period was too short. In general, it is advantageous to incorporate baseline readings into the analysis, but this makes the analysis more complex.

交叉试验特别容易受到患者退出的影响。如果患者在第一阶段后退出,他们无法被纳入分析,因为他们从未接受过另一种治疗。因此,当出现退出时,随机分组的完整性受到破坏,尤其是当某一组的退出率较高时。如果退出人数较多,最好舍弃第二阶段的数据。
Crossover trials are particularly vulnerable to the effects of patient withdrawal. If a patient withdraws after the first period they cannot be included in the analysis because they never received the other treatment. The randomized groups are thus compromised when there are withdrawals, especially when these are more common in one group. If there are many withdrawals it may be best to discard the data from the second period.

在交叉试验的报告中,必须记录所有退出试验的患者及其原因。此外,应描述两个随机分组的基线特征。虽然这在平行组试验中是常规做法,但大多数已发表的交叉试验报告并未提供这些信息。
In a report of a crossover trial it is essential that any withdrawals from the trial are documented, with reasons. Also, the baseline characteristics of the two randomized groups should be described. Although this is routine in parallel group trials, most published reports of crossover trials do not give this information.

15.5 结果的解释 15.5 INTERPRETATION OF RESULTS

15.5.1 单个试验 15.5.1 Single trials

在大多数情况下,临床试验的统计分析相对简单,至少针对主要结局指标,可能仅涉及$tt$ test or a Chi squared test. Interpretation seems straightforward, therefore, but for one difficulty. Inference from a sample to a population relies on the assumption that the trial participants are representative of all such patients. In most trials, however, participants are selected to conform to certain inclusion criteria, so extrapolation of results to other types of patient may not be warranted. For example, most trials of anti- hypertensive agents, such as beta- blocking drugs, are carried out on middle- aged men. Is it

是否合理假设结果同样适用于女性,或年轻或高龄男性?在没有相反信息的情况下,通常会推断结果具有更广泛的适用性,但应考虑不同群体可能有不同反应的可能性。正是基于这种可能性,才进行亚组分析,因为它们可能揭示不同患者群体对治疗(或副作用)效果的差异。不幸的是,如上所述,进行多次此类分析存在得出误导性结果的风险。
reasonable to assume that the results apply to women too, or to young or very old men? In the absence of any information to the contrary it is common to infer wider applicability of results, but the possibility that different groups would respond differently should be borne in mind. It is because of this possibility that subgroup analyses are carried out, as they may give clues about variation in the effectiveness of a treatment (or side- effects) for different groups of patients. Unfortunately, as indicated above, there is a risk of coming up with a misleading result as a consequence of carrying out several such analyses.

15.5.2 所有已发表的试验 15.5.2 All published trials

在许多领域,存在多个类似的临床试验,自然希望一次性评估所有证据。观察同一治疗的一系列临床试验结果时,首先显现的是结果的差异,有时差异显著。我们当然预期治疗效果会有一定的随机变异,因此不必过度担忧。单个试验观察到的治疗效益的置信区间,提供了同样规模试验系列中可能观察到的治疗效益范围的概念。
In many fields there have been several similar clinical trials, and it is natural to want to assess all the evidence at once. The first thing that becomes apparent when looking at the results of a series of clinical trials of the same treatment is that the results vary, sometimes markedly. We would of course expect to see some variation in treatment effect, because of random variation, and should not necessarily be worried by it. The confidence interval for the treatment benefit observed in a single trial gives an idea of the range of treatment benefit likely to be observed in a series of trials of the same size.

近年来,出现了对所有已发表试验数据进行正式统计分析,以获得治疗效果总体评估的趋势。该分析称为综述或荟萃分析(Collins等,1987)。综述常常发现总体治疗效益高度显著,而大多数单个试验未达到显著结果。这并不令人惊讶,因为许多临床试验规模太小,只能检测到极其巨大的治疗效益。对综述的常见批评是,它们结合了患者特征和设计不同的试验信息。然而,任何试验都可视为一系列试验中的一个,代表疾病谱的一部分(Elwood,1982),因此综述得出的清晰结论,表明结果的推广性比单个试验更广泛。但重要的是评估结果是否因试验性质不同而异。
A recent development has been a move towards the formal statistical analysis of data from all published trials to get an overall assessment of treatment effectiveness. The analysis is known either as an overview or a meta- analysis (Collins et al., 1987). Overviews have often found a highly significant overall treatment benefit when most of the individual trials did not get a significant result. Again, this is not surprising, as many clinical trials are too small to detect anything other than an unrealistically huge treatment benefit. A common criticism of overviews is that they combine information from trials with different patient characteristics and designs. However, any trial may be considered as one of a series, representing just part of the spectrum of disease (Elwood, 1982), and so a clear picture emerging from an overview will indicate wider generalizability of results than is warranted from a single trial. It is important, though, to assess whether the results differ according to the nature of the trial.

进行综述时的一个问题是,它们通常基于所有已发表的试验。越来越多证据表明,医学期刊存在发表偏倚(Begg和Berlin,1988),可能是无意的,即如果治疗效果显著,临床试验结果更容易发表;反之则较难发表。此外,当结果不显著时,作者发表的积极性也较低。这种偏倚源于普遍但错误的观念,认为不显著结果无趣或无信息价值。这里我们看到,使用置信区间表达结果而非孤立的P值,可能带来的另一种优势。
A problem in performing overviews is that they usually are based on all the published trials. There is increasing evidence that medical journals exert publication bias (Begg and Berlin, 1988), perhaps unintentionally, by which it is easier to publish the results of a clinical trial in a journal if the treatment effect was significant than if it was not. Also authors make less effort to publish when the results are not significant. Such bias stems from the widespread but mistaken belief that non- significant results are uninteresting or uninformative or both. Here we see another possible benefit of expressing results as a confidence interval rather than an isolated P value.

汇总所有已发表的试验会放大任何发表偏倚,这也是反对综述的主要理由。然而,系统地利用现有信息总比依赖对各试验的主观评估要好。虽然更理想,但提取未发表试验的信息自然极为困难,不过已有少数案例实现了这一点(例如,Yusuf 等,1985年)。
Pooling all published trials will magnify any publication bias, and this is the major argument against overviews. However, it is better to use the available information in a systematic way than to rely on subjective assessment of the various trials. Although preferable, it is naturally exceedingly difficult to extract information about unpublished trials, but there have been a few cases where it has been done (for example, by Yusuf et al., 1985).

15.6 撰写与评估临床试验 15.6 WRITING UP AND ASSESSING CLINICAL TRIALS

无论是撰写临床试验报告还是评估已发表试验的质量,拥有一份重要事项的检查表都非常有用。下一章的图16.2展示了英国医学杂志统计学审稿人用来评审临床试验的检查表。
For both writing up a clinical trial and assessing the quality of a published trial it is useful to have a check list of the important issues. Figure 16.2 in the next chapter shows the check list used by statisticians refereeing clinical trials for the British Medical Journal.

15.6.1 撰写临床试验论文 15.6.1 Writing a paper about a clinical trial

图16.2中的检查表大致说明了报告试验设计和执行时应包含的信息。更多细节可参考Altman等(1989)、Chalmers等(1981)、Gardner等(1989)、Grant(1989)和Simon与Wittes(1985)。非常重要的是要说明所有最初随机分组的患者情况,指出各组中退出的人数。对于大型试验,更好的是展示所有考虑入组患者的最终情况—这可以通过流程图实现(Hampton,1981)。
The check list in Figure 16.2 gives some idea of the information that should be included in a report about the design and execution of a trial. Further details can be found in Altman et al. (1989), Chalmers et al. (1981), Gardner et al. (1989), Grant (1989) and Simon and Wittes (1985). It is very important to account for all the patients that were originally randomized, indicating the numbers in each group that were withdrawn. It is better still, especially for large trials, to show what happened to all patients considered for entry to the trial - this can be done by a flow- chart (Hampton, 1981).

结果部分应包括各组基线特征的信息,特别是已知预后因素的相关情况。需要对组间的可比性进行一些评论—这不应基于假设检验。
The results section should include information about the baseline characteristics of the different groups, especially with respect to known prognostic factors. Some comment on the comparability of the groups is needed - this should not be based on hypothesis testing.

随后应呈现组间比较的结果,同时注意前文讨论的多重结局指标和亚组分析的问题。
Thereafter the results of between group comparisons should be presented, taking note of the problems of multiple outcome measures and subgroup analyses as discussed above.

15.6.2 评估已发表的试验 15.6.2 Assessing published trials

试验必须根据已发表报告中包含的信息进行判断。如果信息未提供,我们不能假设检查表上的任何问题都得到满意回答。正如Colton(1974,第269页)所指出:“作者有责任证明偏倚未发生或不太可能发生。”
Trials must be judged on the information that is included in the published report. We cannot assume a satisfactory answer to any of the questions on the check list if the information is not given. As Colton (1974, p. 269) noted: 'It is the author's onus to demonstrate that bias did not occur or was unlikely to have arisen'.

许多综述显示,已发表临床试验的质量仍有很大提升空间。如今大多数试验都是随机的,但并非
Many reviews have shown that the standard of published clinical trials leaves a lot to be desired. Most trials these days are randomized, but not

所有试验都尽可能地进行了盲法。很少有试验在设计时考虑了检测临床重要治疗效益所需的样本量。试验报告中常见的问题包括未说明用于分析数据的统计方法,以及未包含或未考虑所有随机分组的患者。我将在下一章中更全面地讨论已发表论文的质量。
all are as blind as they could be. Few trials seem to have been planned with regard to the sample size necessary to detect a clinically important treatment benefit. Common problems in reports of trials are the omission of the statistical method used to analyse the data, and failure to include or account for all the patients that were randomized. I discuss the quality of published papers in general more fully in the next chapter.

练习 EXERCISES

【15】1 你将如何评估表3.5所示试验中治疗组在基线时的可比性?
15.1 How would you assess the comparability of the treatment groups at baseline in the trial illustrated in Table 3.5?

【15】2 在一项开放(非盲法)试验中,β受体阻滞剂阿普洛洛尔用于心肌梗死后患者,随机分组在入院时进行(Ahlmark和Saetre,1976)。用药开始于入院两周后,此时原始的393名患者中有60%已退出试验。退出原因主要是死亡、未确诊心肌梗死或β受体阻滞剂禁忌。在实际接受治疗的162名患者中,69名接受阿普洛洛尔,93名接受对照治疗。
15.2 In an open (unblinded) trial of the beta- blocker alprenolol given to patients after myocardial infarction, randomization to alprenolol or the standard treatment was at the time of admission to hospital (Ahlmark and Saetre, 1976). The start of medication was two weeks after admission, by which time $60%(900~\mathrm{mg})10~\mathrm{cm}tt80%(\mathbf{P}< 0.05)30%20%80%\frac{11}{31}\frac{4}{34}\mathbf{P} = 0.024\frac{11}{31}(36%)\frac{4}{34}(12%)\mathbf{P} = 0.02495%$ confidence interval for the RR, and comment on the authors' conclusion that large scale clinical trials are needed. (See also problem 10.7.)

16 医学文献 16 The medical literature

人们确实感到,设计和分析的统计技术有时被采用,更多像是一种仪式,旨在安抚那些最后掌握绝对权力的人(期刊编辑),也可能是监管机构,而不是因为这些技术被认为在科学上重要。
One does feel that statistical techniques both of design and analysis are sometimes adopted rather as rituals designed to assuage the last holders of absolute power (editors of journals) and perhaps also regulatory agencies, and not because the techniques are appreciated to be scientifically important.

Cox (1983)
Cox (1983)

16.1 引言 16.1 INTRODUCTION

本世纪以来,临床研究迅速发展,其对临床实践的影响也日益增强。研究结果的发表,尤其是在权威期刊上,将迅速将这些发现传播到全球。同行评审期刊上的论文意味着该研究既科学合理又具有临床价值—它赋予工作可信度和尊严。如果所有发表的论文都科学合理,那当然很好,但遗憾的是,从统计学角度看,研究质量仍有很大提升空间。几乎任何医学期刊的任何一期中,都能看到设计不当和分析错误的例子。
During this century clinical research has grown enormously as has its influence on clinical practice. Publication of research results, especially in a leading journal, will rapidly disseminate those findings all over the world. A paper in a peer- reviewed journal implies that the research is both scientifically sound and clinically worthwhile - it bestows both credibility and respectability on the work. This would be fine if all published papers were scientifically sound but, regrettably, the standard of research leaves much to be desired from the statistical point of view. Examples of substandard design and incorrect analysis can be seen in almost any issue of any medical journal.

合理设计和分析的重要性不容忽视。显然,研究结论必须建立在正确方法的基础上。如果结论因方法错误而不可靠,那么该研究就没有临床价值。更糟的是,结论误导可能导致临床上的危害,而临床有害的研究无疑是不道德的。
The importance of sound design and analysis cannot be overemphasized. Clearly the conclusions from a study must rely on the methods having been correct. If the conclusions are unreliable because of faulty methodology, then the study cannot be clinically worthwhile. Worse, it may be clinically harmful by reason of the conclusions being misleading, and a clinically harmful study is surely unethical.

因此,带着一定的谨慎,我在1980年提出统计学的误用是不道德的(Altman, 1982a),这一观点后来被广泛认可,且未受到质疑。低质量研究(不仅仅是统计错误)的伦理影响包括:
Thus, with some diffidence, in 1980 I suggested that the misuse of statistics was unethical (Altman, 1982a), a view which has subsequently been widely endorsed but never challenged. The ethical implications of substandard research (not just statistical errors) are:

  1. 滥用患者,使其暴露于无正当理由的风险和不便;
  2. the misuse of patients by exposing them to unjustified risk and inconvenience;
  3. 浪费资源,包括研究人员的时间,这些时间本可以用于更有价值的活动;
  4. the misuse of resources, including the researchers' time, which could be better employed on more valuable activities; and
  5. 发表误导性结果的后果,可能导致进行不必要的进一步研究。
  6. the consequences of publishing misleading results, which may include the carrying out of unnecessary further work.

极端情况下,统计学的使用可能直接影响患者护理。特别是,有多个例子显示,一些基于无对照研究的有希望结果而广泛使用的治疗方法,后来通过随机试验被证明无效(见15.2.1节)。同样,流行病学研究中出现的矛盾结果可能与方法学差异有关。
In the extreme there may be a direct effect on patient care. In particular, there have been several examples of treatments that were widely used on the basis of promising results from uncontrolled studies, but were later shown by randomized trials to be ineffective (see section 15.2.1). Likewise, conflicting results from epidemiological studies may relate to methodological differences.

这些观点导致三个不可避免的结论。首先,研究者必须在研究的设计、执行、分析和解释上极其谨慎。其次,阅读和解读其他研究者的研究结果时也需谨慎,因为他们可能忽视了第一点。正如Albert(1981年)所言:“医生应具备的最重要技能之一是能够批判性地分析医学文献中的原创贡献。”第三,发表论文中统计学水平的高低,可能受医学期刊编辑和审稿政策的影响。
These remarks lead to three inevitable conclusions. First it behoves the researcher to take the greatest care in planning, executing, analysing and interpreting research. Second, care is needed too in reading and interpret. ing the research results from other investigators who may have disregarded the first point. As Albert (1981) has said: 'One of the most important skills a physician should have is the ability to critically analyse original contributions to the medical literature.' Third, the standard of statistics in published papers can be influenced by the editorial and refereeing policy of medical journals.

在本章最后,我将简要回顾统计学在医学研究中的发展,总结对已发表论文中统计质量的评估结果,探讨医学期刊在提升质量中的作用,并提供阅读和撰写科学论文的指导。
In this final chapter I shall briefly consider the growth of statistics within medical research, summarize the findings of reviews of the quality of statistics in published papers, consider the role of medical journals in improving the quality, and give guidance on reading and writing scientific papers.

16.2 医学研究中统计学的发展 16.2 THE GROWTH OF STATISTICS IN MEDICAL RESEARCH

无法确切界定统计方法何时引入医学研究,但除少数显著例外外,我们可将其起点视为本世纪前二十五年。1929年,一篇生理学期刊发表了大量关于统计分析和解释主要原则的论文(Dunn, 1929)。到1937年,临床研究中统计方法的正确使用已被认为重要,Lancet发表了Austin Bradford Hill撰写的15篇统计方法系列文章。这些文章很快以书籍形式再版;这本有影响力的书50年后仍在印刷(Hill, 1984),足以证明其价值。
It is not possible to pinpoint the introduction of statistical methods into medical research, but with a few notable exceptions we may look to the first quarter of this century. In 1929 a huge paper was published in a physiology journal expounding many of the main principles of statistical analysis and interpretation (Dunn, 1929). By 1937 the correct use of statistical methods in clinical research was considered important enough for the Lancet to publish a series of 15 articles on statistical methods by Austin Bradford Hill. These were quickly republished in book form; that this influential book remains in print 50 years later (Hill, 1984) pays tribute to its quality.

我们可以认为现代医学统计学的兴起始于1937年或1948年,即第一项著名随机临床试验报告发表之时。这是英国医学研究委员会关于肺结核链霉素的试验(Medical Research Council, 1948),Bradford Hill在其中起了关键作用。然而,统计学引入医学研究的进程总体较慢。1954年,英国医学杂志报道了皇家统计学会“医学统计研究小组”举行的一场辩论(Anon, 1954)。辩题是“本院应欢迎统计学在医学各分支中日益增长的影响力”。反方观点认为医学是艺术,统计学是科学,因此统计学不适合医学。更令人惊讶的是,考虑到辩论场合,该动议仅以微弱多数通过。
We might consider the modern rise of medical statistics to start either in 1937 or in 1948, when the report of the first well known randomized clinical trial was published. This was the Medical Research Council trial of streptomycin for pulmonary tuberculosis (Medical Research Council, 1948), in which Bradford Hill was a key influence. However, in general the introduction of statistics into medical research was slow. In 1954 the British Medical Journal reported a debate held by the Royal Statistical Society's 'Study Circle on Medical Statistics' (Anon, 1954). The motion was 'This house should welcome the growing influence of statistics in all branches of medicine'. The opposite of the motion made the remarkable observation that medicine was an art and statistics was a science, so statistics was out of place in medicine. More surprising, considering the forum, is that the motion was carried by only a narrow majority.

自1954年以来,医学统计学发展迅速,统计方法现已牢固植根于医学领域。通过对1952、1962、1972和1982年《儿科学》杂志发表论文的研究(Hayden, 1983),我们可以清楚看到统计学的增长。如表16.1所示,使用统计分析方法的论文比例大幅增加,且使用除简单检验、检验和相关分析之外方法的论文增长了十倍。后者的变化在1972至1982年间尤为显著。另一项对《新英格兰医学杂志》1978-79年论文的研究显示,45%的论文仅使用简单统计方法(Colditz和Emerson, 1985)。还有一项研究对比了1967-68年与1982年《关节炎与风湿病》杂志中的统计分析(Felson等,1984)。如表16.2所示,两时期间统计分析显著增加,这至少部分归因于计算机的普及。含有统计错误的论文比例基本相同,但错误性质发生了较大变化。
Much has happened since 1954, and statistical methods are now firmly entrenched in medicine. A good idea of the growth of statistics is given by a study of papers published by the journal Pediatrics in the years 1952, 1962, 1972 and 1982 (Hayden, 1983). As Table 16.1 shows, there was a large increase in the proportion of papers using statistical methods of analysis and a ten fold rise in the use of methods beyond simple and tests and correlation. The change in the latter was especially marked between 1972 and 1982. A similar study of the New England Journal of Medicine showed that of papers published in 1978- 79 used only simple methods of statistical analysis (Colditz and Emerson, 1985). Another study contrasted the statistical analyses in papers published in Arthritis and Rheumatism in 1967- 68 and 1982 (Felson et al., 1984). As Table 16.2 shows, they found some marked changes between the two periods. Papers published in 1982 contained many more statistical analyses, which may be at least partly due to the availability of computers. The proportion of papers containing statistical errors was much the same, but the nature of the errors had changed considerably.

表16.1 《儿科学》中统计程序的使用(Hayden, 1983)
Table 16.1 Use of statistical procedures in Pediatrics (Hayden, 1983)

年份
1952196219721982
论文数量6798115151
无统计程序66%59%45%30%
除t检验、χ²检验和相关分析外的统计程序3%5%12%35%
Year
1952196219721982
Number of papers6798115151
No statistical procedures66%59%45%30%
Statistical procedures other than t, χ² and r3%5%12%35%

多位作者对期刊进行了全面研究,以了解哪些统计方法最常用。表16.3展示了1978-79年《新英格兰医学杂志》论文中使用方法的降序排列,以及仅使用该表中方法的论文累计百分比。我们可以看到,列出的广泛技术涵盖了大多数但非全部发表论文。过去十年,高级统计技术的使用进一步增加,因此使用表16.3未列方法的论文比例可能有所上升。
Several authors have carried out a comprehensive study of journals to see which methods are in most frequent use. Table 16.3 shows the methods found in the review of the New England Journal of Medicine in 1978- 79 in decreasing order of use, with the cumulative percentage of all papers that contained only methods that far down the table. We can see that the wide range of techniques listed covers most but not all published papers. The last ten years have seen a further increase in the use of more advanced statistical techniques, so that the proportion of papers which use methods not shown in Table 16.3 is likely to have increased.

表16.2 1967-68年与1982年《关节炎与风湿病》杂志中选定统计方法的使用情况及发现的错误数量(Felson等,1984)
Table 16.2 Use of selected statistical methods in Arthritis and Rheumatism in 1967-68 and 1982, and numbers of errors found (Felson et al., 1984)

1967-68(n = 47)1982(n = 74)
统计方法:
t检验8 (17%)37 (50%)
卡方检验9 (19%)22 (30%)
线性回归1 (2%)18 (24%)
多重统计检验4 (9%)30 (41%)
错误:
方法未定义14 (30%)7 (9%)
位置或离散度指标描述不充分6 (13%)7 (9%)
重复观察被当作独立处理1 (2%)4 (5%)
两个组比较超过10个变量,显著性水平为5%28 (38%)
多重t检验代替方差分析2 (4%)18 (24%)
预期频数过小却使用卡方检验3 (6%)4 (5%)
以上至少一种错误28 (60%)49 (66%)
1967-68 (n = 47)1982 (n = 74)
Statistical method:
t test8 (17%)37 (50%)
Chi squared test9 (19%)22 (30%)
Linear regression1 (2%)18 (24%)
Multiple statistical tests4 (9%)30 (41%)
Error:
Undefined method14 (30%)7 (9%)
Inadequate description of measures of location or dispersion6 (13%)7 (9%)
Repeated observations treated as independent1 (2%)4 (5%)
Two groups compared on &gt; 10 3 (6%) variables at 5% level28 (38%)
Multiple t tests instead of analysis of variance2 (4%)18 (24%)
Chi squared tests used when expected frequencies too small3 (6%)4 (5%)
At least one of above errors28 (60%)49 (66%)

医学文献中统计方法种类极为多样。不幸的是,如下一节所示,已发表论文中统计信息的可靠性令人担忧。因此,批判性评估研究论文至关重要,而这需要熟悉广泛的统计概念和方法。除流行病学方法外,表16.3中列出的所有主题均包含在本书中。
There is thus an enormous diversity of statistical methodology in the medical literature. Unfortunately, as shown in the next section, the reliability of the statistical information in published papers is worryingly low. Thus it is essential to be able to assess critically research papers, to which end it is necessary to be familiar with a large range of statistical concepts and methods. Apart from epidemiological methods, all of the topics listed in Table 16.3 are included in this book.

除统计分析方法的变化外,研究设计类型也发生了同步变化。对1946-76年间《新英格兰医学杂志》发表研究设计的回顾显示,临床试验数量及受控试验比例增加,而队列研究减少,横断面研究增加(Fletcher和Fletcher,1979)。尽管临床试验备受关注,但它们仍仅占医学期刊发表论文的小部分,约为5%。
As well as changes in the methods of statistical analysis there has been a simultaneous change in the types of research design used. A review of the design of studies published in the New England Journal of Medicine over a similar period (1946- 76) showed an increase in clinical trials, and in the proportion of trials that were controlled, but also a decrease in cohort studies in favour of cross- sectional studies (Fletcher and Fletcher, 1979). Much is written about clinical trials, but they still represent a small minority, perhaps only , of all papers published in medical journals.

表16.3 1978-79年《新英格兰医学杂志》中最常用的统计技术(Emerson和Colditz,1983)
Table 16.3 The most common statistical techniques in the New England Journal of Medicine in 1978-79 (Emerson and Colditz, 1983)

技术论文累计百分比
1 无统计方法或仅描述性方法58
2 t检验67
3 列联表(卡方检验)73
4 非参数检验75
5 流行病学方法77
6 皮尔逊相关(r)79
7 简单线性回归82
8 方差分析84
9 变换86
10 秩相关87
11 生命周期表分析89
12 多元回归90
13 多重比较92
TechniqueCumulative % of papers
1 No statistical methods or descriptive methods only58
2 t tests67
3 Contingency tables (x²)73
4 Non-parametric tests75
5 Epidemiological methods77
6 Pearson correlation (r)79
7 Simple linear regression82
8 Analysis of variance84
9 Transformations86
10 Rank correlation87
11 Life table analysis89
12 Multiple regression90
13 Multiple comparisons92

16.3 已发表论文中的统计学 16.3 STATISTICS IN PUBLISHED PAPERS

统计方法的误用自始即是问题。早在1932年,Greenwood评论过去20年变化时写道:“医学论文现在经常包含统计分析,有时这些分析是正确的,但作者同样频繁地违反统计或一般逻辑推理的基本原则”(Greenwood,1932)。1950年Hogben写道:“不到1%的研究者清楚理解他们常用统计技术的基本原理”(Hogben,1950)。更近的评论也表达了同样的信息:“几乎不可能阅读领先癌症期刊的某一期而不对研究设计、数据收集、反应定义、结果确定及报告产生严重质疑”(Hoogstraten,1984)。这些评述虽未基于系统性回顾,但自1960年代以来已有许多此类回顾。
It is clear that the misuse of statistical methods has been a problem from the outset. As early as 1932, commenting on changes over the preceding 20 years, Greenwood wrote: 'Medical papers now frequently contain statistical analyses, and sometimes these analyses are correct, but the writers violate quite as often as before, the fundamental principles of statistical or of general logical reasoning' (Greenwood, 1932). In 1950 Hogben wrote 'Less than 1 per cent of research workers clearly apprehend the rationale of statistical techniques they commonly invoke' (Hogben, 1950). A much more recent comment contains the same message: 'It is nearly impossible to read an issue of leading cancer journals without giving rise to serious questions about study design, data collection, definitions of response, determination of results, and the reporting of results' (Hoogstraten, 1984). These assessments were not supported by systematic reviews of the content of published papers, but since the 1960s there have been many such reviews.

16.3.1 文献综述 16.3.1 Reviews of the literature

我所知关于医学期刊中统计质量的最早评论是 Dunn(1929年)提出的,他观察到所检阅的一系列发表论文中有一半在统计学上不可接受。
The earliest comment I know of relating to the quality of statistics in medical journals is that by Dunn (1929), who observed that half of a series

最早的现代综述之一是 Schor 和 Karten(1966年)对十种医学期刊中发表的295篇论文的审查。他们认为28%的论文在统计学上是可接受的,68%存在缺陷,5%“无法挽救”。随后对众多不同综合及专科期刊发表论文的多次综述显示了大致相似的情况。由于评审者使用的标准差异较大,这些研究难以总结,但通常发现约半数论文至少存在一处统计错误。错误的重要性也难以评估。许多小错误对研究总体结论无实质影响,但部分错误可能导致严重的解释偏差。
of published papers examined were not acceptable statistically. One of the first modern reviews was by Schor and Karten (1966) who examined 295 papers published in ten medical journals. They considered that of the papers were statistically acceptable, were deficient, and were 'unsalvageable'. The many subsequent reviews of papers published in numerous different general and specialist journals have found a broadly similar picture. It is difficult to summarize these studies because of the wide range of criteria used by the reviewers, but they have typically found that about half of the papers examined included at least one statistical error. It is also hard to say how important these errors are. Certainly many minor errors will have no material bearing on the overall conclusions of a study, but some may lead to major errors of interpretation.

大多数综述关注统计分析错误,但也有一些关注设计错误,尤其是临床试验设计。例如,Tyson 等(1983年)按照预设标准,回顾了四种期刊中发表的86项围产期医学治疗试验报告。其结果总结于表16.4,显示所检论文存在重大缺陷。部分信息缺失可能因报告不充分而非设计不当,但阅读论文时不能假设未说明的内容。例如,若临床试验报告提及采用随机分配但未提供更多细节,
Most reviewers have looked at errors in statistical analysis, but some have looked at errors in design, especially for clinical trials. For example, Tyson et al. (1983) reviewed reports of 86 therapeutic trials in perinatal medicine published in four journals, using predetermined criteria. Their results are summarized in Table 16.4, and show major deficiencies in the papers examined. Some of the missing information may be due to poor reporting rather than bad design, but when reading a paper we cannot assume things that are not stated. For example, if a report of a clinical trial mentions that random allocation was used but offers no further information

表16.4 围产期医学中86项治疗试验综述摘要(Tyson 等,1983年)
Table 16.4 Summary of review of 86 therapeutic trials in perinatal medicine (Tyson et al.,1983)

符合标准的研究比例(%)
不明确
目的陈述9460
明确的终点变量定义74125
计划性前瞻性数据收集483022
预定样本量(或顺序试验)31671
样本量说明9361
受试者疾病/健康状态说明(n=85)512029
排除标准说明(n=81)46945
随机化(如可行)适当执行并记录(n=69)91279
采用盲法,或无盲法对结果无偏倚可能(n=83)49474
样本量充足154441
统计方法明确,适当使用及解释26074
建议/结论合理107119
% of studies fulfilling criteria
YesUnclearNo
Statement of purpose9460
Clearly defined outcome variables74125
Planned prospective data collection483022
Predetermined sample size (or a sequential trial)31671
Sample size specified9361
Disease/health status of subjects specified (n = 85)512029
Exclusion criteria specified (n = 81)46945
Randomization (if feasible) appropriately performed and documented (n = 69)91279
Blinding used, or lack of blinding unlikely to have biased results (n = 83)49474
Adequate sample size154441
Statistical methods identified, appropriately used and interpreted26074
Recommendations/conclusions justified107119

关于所用程序,我们不能假设他们真的进行了随机分配。许多研究者不理解“随机”的含义。同样,当统计方法未明确说明时,我们也不能假设其方法适当。这也是为何评审者在近四分之三的论文中无法判断结论是否合理。
about the procedure used, we cannot assume that they really did randomize. Many researchers do not understand what 'random' means. Likewise, we cannot assume that the statistical methods were appropriate when, as is often the case, the methods are not identified. This is why the reviewers felt unable to judge whether the conclusions were justified in nearly three- quarters of the papers examined.

最近对约150项此类研究进行了综合回顾(Johnson 和 Altman,1990;Altman 和 Johnson,1990)。结果显示错误发生频率并未随时间减少,尽管近期评审者对错误的认定更为严格。
A comprehensive review of about 150 such studies has been carried out recently (Johnson and Altman, 1990; Altman and Johnson, 1990). It provides little evidence that the frequency of errors is diminishing over time, although it is likely that more recent reviewers have taken a harder line over what they considered to be errors.

统计错误可能发生在研究的任何阶段:规划、设计、执行、分析、呈现和解释。研究规划阶段,若盲目接受其他已发表论文的结果,可能导致设计或样本量判断错误。然而,从设计到解释的其他阶段更明显容易出错,下面将逐一讨论。所举例子并非详尽无遗。
Statistical errors can occur at any stage of a study: planning, design, execution, analysis, presentation and interpretation. When planning a study, it is possible to make incorrect judgements about the design or sample size if the findings of other published papers are accepted uncritically. However, the other stages of research, from design through to interpretation, are more obvious places where things can go wrong, and I shall consider each in turn. The examples given are by no means comprehensive.

16.3.2 设计中的错误 16.3.2 Errors in design

对临床试验的回顾显示,在试验设计和执行相关的重要信息报告方面存在重大缺陷。或许更令人担忧的是,这些回顾还显示,相当比例的论文报告了采用次优设计的研究。例如,尽管大量文献敦促临床试验采用尽可能高的设计标准,研究仍常常在无同期对照或同期但非随机对照的情况下进行,并且在本可采用盲法时却未使用。一个主要的担忧是,设计较差的研究容易产生偏倚,特别是可能产生过于乐观的结果。Fletcher 和 Fletcher(1979)举了几个例子,说明基于薄弱研究设计的结论后来被设计良好的研究所纠正。所有比较同一治疗的设计良好与设计较差的临床试验结果的回顾都发现,后者得到的治疗效果更大(Altman 和 Johnson,1990)。如果较弱的研究通常规模也较小,那么发表偏倚的影响(见第15.5.2节)可能更为严重。
Reviews of clinical trials have shown major deficiencies in the reporting of vital information relating to the design and execution of the trial. Perhaps more worryingly, they have also shown that a fair proportion of papers report studies that have used suboptimal design. For example, despite a huge literature urging the highest possible standards of design for clinical trials, studies are still commonly carried out without concurrent controls, or with concurrent but non- randomized controls, and blinding is not used when it could have been. A major worry is that studies with inferior designs are open to bias, and in particular may produce over- optimistic findings. Fletcher and Fletcher (1979) give several examples where conclusions based on weak research designs were later corrected by subsequent well designed studies. All reviews comparing the results of well designed and poorly designed clinical trials of the same treatments have found that the latter obtained larger treatment effects (Altman and Johnson, 1990). If, as is likely, the weaker studies are also smaller, then the effect of publication bias (see section 15.5.2) may be more severe.

这些问题并不限于临床试验。对诊断测试评估研究的回顾同样显示设计和报告存在重大缺陷(Sheps 和 Schechter,1984)。然而,在临床试验领域之外,更难对主要错误做出普遍性描述;第5章给出了几个潜在困难的例子。
The problems are not confined to clinical trials. Reviews of studies evaluating diagnostic tests have similarly been shown to have major deficiencies in design and reporting (Sheps and Schechter, 1984). However, outside the field of clinical trials it is harder to make general statements about the main errors that are made; Chapter 5 gave several examples of potential difficulties.

一些问题的原因在于许多研究实际上并非设计出来,而是“偶发”的。这些研究基于为其他目的收集的既有数据进行分析。尽管许多此类研究报告承认研究是回顾性的,但有些则假装研究是前瞻性的,因此是计划好的,因为这样看起来更好,暗示想法先于数据。无设计研究的表现包括所用治疗和评价方法的变异,不同受试者观察次数不等,许多缺失观察,以及对所做内容和原因的模糊描述。
One reason for some of the problems is that many studies are not actually designed but rather 'happen'. They are based on an analysis of pre- existing data that were collected for some other purpose. While many reports of such studies admit that the study was retrospective, some pretend that the study was prospective, and thus planned, as it looks better to suggest that the idea came before the data. Symptoms of undesigned studies are variation in the treatments and methods of evaluation used, unequal numbers of observations for different subjects, many missing observations, and a general vagueness about what was done and why.

现有数据使用的一个例子是许多利用胎儿超声测量建立参考标准的研究。几乎所有已发表的研究都基于既有数据分析,因此每个胎儿的观察次数不同,测量时间也非预先指定。虽然早孕期进行一次常规超声检查很常见,但除非有临床关注原因,通常不会进行进一步的超声测量。因此,在这些数据集中多次出现的胎儿很可能是非典型的,平均体型可能不同,导致数据并非其所声称的那样。Green 和 Byar(1984)讨论了从登记数据而非为特定目的收集的数据分析可能产生的一些问题。
An example of the use of existing data is seen in many studies using fetal ultrasound measurements to develop reference standards. Virtually all published studies are based on the analysis of existing data, so that the number of observations per fetus varies and the measurements are not taken at pre- specified times. While a single routine ultrasound examination in early pregnancy is common, further ultrasound measurements are not usual unless there is some cause for clinical concern. Thus those fetuses that are represented several times in these data sets are likely to be atypical and quite possibly of a different size on average, so that the data are not what they purport to be. Green and Byar (1984) have discussed some of the problems that can arise from the analysis of data from registries rather than data collected for the purpose in hand.

其他设计问题在第5章中提及。例子包括:选择不适当的高风险样本来推断一般人群;病例对照研究中选择不适当的对照;以及当人们可以选择治疗时产生的志愿者偏倚。另一个潜在问题是“健康工人效应”,即就业者比一般人群更健康;在研究工业暴露某些危害的可能不良影响时需考虑此效应。还有一个问题是,在比较两种替代测量方法的研究中使用不同观察者(见第14.2节)。如果每种方法仅由一名观察者使用,则观察者之间的系统性差异与方法差异不可分割(或称“混杂”)。
Other design problems were referred to in Chapter 5. Examples are the choice of an inappropriate high risk sample to make inferences about the general population; choice of inappropriate controls in a case- control study; and the volunteer bias that arises when people can choose their treatment. Another example of a potential problem is the 'healthy worker effect', whereby people in employment are healthier than the general population; this needs to be considered in studies of possible adverse effects of industrial exposure to some hazard. Yet another is the use of different observers in a study to compare two alternative methods of measurement (see section 14.2). If each method is used by only one observer there is an inseparability (or 'confounding') of any systematic differences between the observers with any difference between the methods.

这些例子仅用于说明可能出现的各种陷阱。冒着重复的风险,我再次强调,寻求统计专家建议的最佳时机是在设计研究时,这样才能及时发现并纠正此类缺陷。
These few examples serve only to illustrate the wide variety of possible pitfalls. At the risk of repetition, I shall say again that the best time to seek expert statistical advice is when you are planning a study, so that any flaws of this sort can be spotted and rectified.

最后,存在样本量不足的根本问题。如第8.5.4节所述,Freiman等人(1978年)的一项综述显示,许多发表的临床试验因样本量过小,导致无法检测出显著的治疗效果,从而得出治疗间无显著差异的结论。很少有已发表的研究报告其样本量是基于统计效能计算确定的。事实上,样本量计算的概念在临床试验领域之外的医学研究中几乎是未知的,尽管同样的方法同样适用于所有比较研究,并可用于规划任何调查。
Lastly, there is the fundamental problem of having an inadequate sample size. As noted in section 8.5.4, a review by Freiman et al. (1978) showed that many published clinical trials that find a non- significant difference between treatments had little chance of detecting major treatment effects due to small sample sizes. Few published studies report that the sample size was chosen on the basis of power calculations. Indeed, the concept of

例如,最近有一篇论文讨论了流变学研究的样本量计算(Stuart等,1989年)。
sample size calculations seems almost unknown in medical research outside the field of clinical trials, although the same methods are equally applicable to all comparative studies and can be used in planning any investigation. For example, a recent paper has discussed sample size calculations for rheological studies (Stuart et al., 1989).

即使使用了效能计算来确定样本量,实际招募的受试者数量可能仍不及预期。在临床试验中,实际招募率常常远低于预期,部分原因是对符合条件受试者数量的高估,部分原因是受试者不愿意参与试验。
Even when power calculations have been used to calculate sample size, the supply of subjects may not be as great as anticipated. It is common in clinical trials for the actual recruitment rate to fall far short of that anticipated, partly because of overestimation of the number of eligible subjects and partly because of their unwillingness to enter the trial.

虽然值可能掩盖研究样本量过小的事实,但非常宽泛的置信区间则表明缺乏有用信息—这也是支持使用置信区间的理由之一(见第8.8节)。
Whereas values can disguise the fact that a study was too small, a very wide confidence interval indicates the lack of useful information - this is one of the arguments in favour of the use of confidence intervals (see section 8.8).

16.3.3 执行中的错误 16.3.3 Errors in execution

尤其是在前瞻性研究中,数据收集可能无法按计划进行。换句话说,研究方案或计划未被严格遵守。第15章讨论了临床试验中可能出现的各种问题,特别是关于正确排除不符合条件的受试者以及确保每位患者接受分配给他们的治疗。有人可能认为基于奇偶数的简单分配方案错误率较低,但事实并非如此。在一项使用手术日期的奇偶数和出生年份的奇偶数将受试者分配到三种不同心脏瓣膜假体的研究报告中,作者写道:“我们发现……随机分配程序在整个研究期间并不完全一致,部分原因是供应困难,部分原因是奇偶数标准有时被误解”(Kuntze 等,1989)。未提供更多细节。提及供应困难暗示可能存在某些时候三种装置并非全部可用,但我们不知道他们如何处理这一问题。奇偶日期的问题表明,这种看似简单的方案实际上可能比使用预先准备的信封进行随机分配更难正确操作,此外该设计在其他方面也较差。(当然,他们所用的分配系统并非真正的随机分配,尽管他们错误地声称如此。)
In prospective studies in particular, the collection of data may not go according to plan. Another way of expressing this idea is that the study plan or protocol is not strictly adhered to. Various problems that can occur in clinical trials were discussed in Chapter 15, notably with regard to the correct exclusion of ineligible subjects and ensuring that each patient received the treatment that was allocated to them. It might be thought that simple allocation schemes, such as those based on odd or even numbers, would be less prone to error, but this is not so. In the report of a study that used both odd and even date of operation and odd or even year of birth to allocate subjects to three different types of heart valve prothesis, the authors wrote: 'We found … that the randomisation procedure was not entirely consistent throughout the study period, partly because of supply difficulties and partly because the odd/even criteria were sometimes misunderstood' (Kuntze et al., 1989). No further details were given. The mention of supply difficulties suggests that there may have been times when not all three devices were available, but we are not told how this was dealt with. The problems with odd and even dates suggest that this apparently simple scheme may indeed be harder to operate correctly than, for example, randomization using prepared envelopes, apart from this being an inferior design for other reasons. (Of course, the allocation system they used was not randomization, as they wrongly claimed.)

数据缺失可能是数据收集系统失败的结果,例如忽视了周末或假期期间的情况。如果数据直到研究结束才被检查,发现问题时往往已经无法纠正。
Missing data may be the consequence of a failure in the data collection system, for example arising from neglecting to consider what would happen during weekends or holidays. If the data are not examined until the end of the study, by the time that any problems are spotted it will be too late to rectify them.

16.3.4 分析中的错误 16.3.4 Errors in analysis

分析中的错误遗憾地很常见。在本书前几章中,我曾警告过不当使用所介绍方法的情况。这些警告基于对医学期刊中常见误用的了解。因此,以下基本错误经常出现:
Errors in analysis are regrettably common. In earlier chapters of this book I have warned against improper uses of the methods introduced. These warnings are based on knowledge that such misuses are common in medical journals. Thus the following basic errors are frequently made:

  1. 在假设不满足时使用分析方法;
  2. using methods of analysis when the assumptions are not met;
  3. 分析配对数据时忽略配对关系;
  4. analysing paired data ignoring the pairing;
  5. 未考虑有序类别;
  6. failing to take account of ordered categories;
  7. 将同一受试者的多次观察视为独立;
  8. treating multiple observations on one subject as independent;
  9. 使用多重配对比较代替考虑所有组的整体分析(如方差分析);
  10. using multiple paired comparisons instead of an analysis that considers all groups (e.g. analysis of variance);
  11. 在组内进行分析后,再通过比较 值或置信区间来比较各组;
  12. performing within group analyses and then comparing groups by comparing values or confidence intervals;
  13. 报告包含不可能值的置信区间。
  14. quoting confidence intervals that include impossible values.

我将这些错误称为“基本错误”,因为它们反映了对基本统计概念的缺乏理解,实在难以原谅。
I have described these errors as 'basic' because they demonstrate a lack of understanding of fundamental statistical concepts. They are not really excusable.

然而,还有其他错误,可能同样甚至更为严重,但这些错误更多是逻辑上的,而非技术上的。一些例子,均已在前面章节讨论过,包括:
There are other errors, however, which may be equally or even more serious but where the error is more one of logic than technique. Some examples, all discussed in earlier chapters, are:

  1. 在方法比较研究中使用相关性;
  2. using correlation in method comparison studies;
  3. 使用相关性比较两组时间相关的观测值;
  4. using correlation to compare two sets of time-related observations;
  5. 通过假设检验评估两个或多个组的可比性;
  6. assessing the comparability of two or more groups by means of hypothesis tests;
  7. 仅通过敏感性和特异性评估诊断试验。
  8. evaluating a diagnostic test solely by means of sensitivity and specificity.

这些错误的出现或许更能被理解,因为它们较为微妙,尽管医学期刊中已有多次论述。期刊在审稿时未能发现这些错误则几乎没有理由。
There is perhaps rather more excuse for these errors being made as they are rather more subtle, although the errors have been written about many times in medical journals. There is little excuse for journals not detecting them when papers are submitted.

另一种不可接受的做法是基于数据的一个子集得出结论。除了多次分析多个子集或亚组会导致极大概率出现显著结果(即 )外,将子集分析作为主要发现可能会扭曲整体情况。一个例子是对乳腺癌患者两种化疗方案的临床试验(Lippman 等,1984)。两组疾病进展时间的总体比较采用了log-rank检验,结果为 。然而,作者还比较了仅对治疗有反应的患者的进展时间,发现组间存在显著差异()。论文摘要中仅出现了后者分析,且未提及总体无显著差异。无论是治疗反应比例还是
Another unacceptable practice is to base conclusions on a subset of the data. Apart from the fact that investigation of many subsets or subgroups will lead to a high probability that something will turn up (i.e. yield , presentation of a subset analysis as the main finding may distort the picture. An example is given in a clinical trial comparing two chemotherapy regimes in breast cancer patients (Lippman et al., 1984). The overall comparison of time to progression of disease in the two groups was performed by the logrank test and gave . However, the authors also compared the time to progression among only those patients who responded to treatment, for whom there was a significant difference between the groups . Only the latter analysis appears in the summary of the paper, with no mention that there was no overall significant difference. Either the proportion responding to treatment or

进展时间(或生存时间)可能被视为本研究的合适终点,但比较必须基于所有患者,而非选定的子集。
time to progression (or survival time) might be considered a suitable end- point for this study, but the comparison must be based on all patients, not a selected subset.

16.3.5 结果展示中的错误 16.3.5 Errors in presentation

在结果展示方面,医学期刊中也普遍存在几种常见错误:
With presentation of results too there are several common errors that abound in medical journals:

【1】使用标准误(或置信区间)作为描述性信息;

  1. using standard errors (or confidence intervals) for descriptive information;
    【2】仅呈现连续数据的均值(或中位数)而未提供变异性指标;
  2. presenting means (or medians) of continuous data without any indication of variability;
    【3】仅以值来呈现统计分析结果。
  3. presenting the results of a statistical analysis solely as a value.

除最后一点外,这些问题同样适用于数值和图形展示。
All but the last of these problems relate equally to numerical and graphical presentation.

(a) 数值精度 (a) Numerical precision

结果展示中一个常见的不足是数值精度的使用。虚假的精度无助于论文,反而降低其可读性和可信度。虽然难以制定绝对规则,但以下指导或许有帮助。展示汇总统计量或分析结果(如均值、标准差和回归方程)时,应考虑原始数据的精度。均值通常不应比原始数据多保留一位小数,但标准误和标准差可能需要多保留一位小数。百分比最多保留一位小数即可,尤其是在样本量较小时。如果同时给出分子和分母(通常应如此),则百分比可四舍五入至整数。检验统计量如不必超过两位小数。同样,值保留一到两位有效数字即可,且不必精确到0.0001以下(详见第8.10节)。其他具体建议见前几章。
One aspect of presentation that is often poor is the numerical precision used to present data and results. Spurious precision adds nothing to a paper and impairs its readability and credibility. It is hard to provide absolute rules, but the following guidelines may help. When presenting summary statistics or the results of analyses, such as means, standard deviations and regression equations, the precision of the original data should be borne in mind. Means should not usually be quoted to more than one further decimal place than the raw data, but standard errors and standard deviations may require one extra decimal place. Percentages do not need to be given to more than one decimal place at most, especially in small samples. If the numerator and denominator are given, as should usually be the case, then there is no reason not to quote percentages to the nearest integer. Test statistics such as and do not need to be given to more than two decimal places. Likewise values do not need more than one or two significant digits, and it is not necessary to be specific below, say, 0.0001 (see section 8.10). Other specific advice is given in several earlier chapters.

以下是一些不必要(或虚假)精度的例子,均摘自已发表论文:
Some examples of unnecessary (or spurious) precision, all from published papers, are:

一个来自回归分析的例子是以下方程,描述了
An example from a regression analysis is the following equation relating

出生体重(单位:千克,BWt)与胸围(CC)及上臂中围(AC)(均为厘米)之间的关系(Bhargava 等,1985年)
birth weight in kg (BWt) to chest circumference (CC) and mid- arm circumference (AC) (both in cm) (Bhargava et al., 1985)

该方程声称可以预测出生体重精确到 !许多此类例子可能源自计算机输出的精确转录。
which purports to predict birthweight to the nearest Many such examples may arise from exact transcription from computer output.

最后,文献综述中常见的一个问题是均值后使用 符号,但未说明符号后数字是标准差还是标准误。 的用法已存在数十年,但其歧义性极大,以至于包括《英国医学杂志》和《柳叶刀》在内的多家医学期刊现已禁止使用,尽管大多数期刊仍允许。因此,例如,“平均血压为 ” 应改为“平均血压为 (标准差 7.1)”(或如果是标准误则注明)。首选的标注方式明确无歧义,也避免了标准差(或标准误)可以为正或负的错误暗示,以及平均值 标准差(或标准误)范围特别重要的误解。
Lastly, a common problem found in many reviews of the literature is the use of the sign after a mean without specifying whether the number after the sign is the standard deviation or standard error. The usage has been around for many decades, but the ambiguity is so serious that several medical journals, including the British Medical Journal and Lancet, do not now allow its use, although most journals still do. Thus, for example, the phrase 'the mean blood pressure was ' would be changed to 'the mean blood pressure was (SD 7.1)' (or SE if that were the case). The preferred notation is unambiguous and also avoids the incorrect implications that the standard deviation (or standard error) can be positive or negative and that the range given by mean (or mean ) is of especial interest.

(b) 图形展示 (b) Graphical presentation

虽然有些结果只能用表格或图形展示,但在两者均可用的情况下,选择哪种方式更优存在很大不确定性(期刊通常不允许作者同时用两种方式展示同一数据)。我认为无法给出简单的通用指导。表格能更准确地呈现结果,但许多人更喜欢图形,因为图形更易于理解数据的含义。图形在展示个体受试者数据时最具优势,例如散点图或时间趋势图,而非仅展示汇总信息。
Although some results can be displayed only as a table or only as a graph, in cases where either is possible there is much uncertainty about which is preferable (journals will not allow the author to display the same data both ways). Again, I do not think it is possible to give simple general guidance. Results are given more accurately in tables, but many people find graphs preferable for seeing the message of the data. Graphs are most advantageous when they show data for individual subjects, for example as scatter diagrams or time trends, rather than summary information.

这里无法详尽讨论图形展示的注意事项。现已有多本专著涉及该主题,其中 Tufte(1983年)的著作尤为值得一读。在本节内容中,值得考虑图形可能产生误导的几种方式。
There is not room here for a comprehensive discussion of the dos and don'ts of graphical presentation. There are now several books devoted to the topic, of which that by Tufte (1983) is particularly worth reading. In the context of this section, however, it is worth considering some ways in which graphs can be misleading.

散点图是一种特别有价值的图形类型,能够赋予相关性或回归分析以实际意义。同样,显示多个受试者组内所有观测值的图形远优于仅显示汇总统计的图形。图 3.14 和图 9.6 展示了同一组数据的两种风格,形成鲜明对比。仅显示均值和标准误等汇总统计的图形(如图 9.6)很少值得占用版面。
Scatter diagrams are a particularly valuable type of graph, and give meaning to a correlation or regression analysis. Likewise graphs that show all observations within several groups of subjects are far preferable to those showing just summary statistics. The contrast is illustrated by Figures 3.14 and 9.6, which show the same set of data in both styles. Graphs showing only summary statistics, such as means and standard errors (Figure 9.6), are rarely worth the space they occupy.

图形中常见的误导性特征有:
Common misleading features of graphs are:

【1】纵轴缺乏真正的零点;

  1. the lack of a true zero on the vertical axis;
    【2】坐标轴中间尺度的变化(在直方图中尤为恶劣);3. 三维效果;4. 散点图中未显示重合点;5. 显示拟合的回归线但未展示原始数据的散点图;6. 叠加两个(或更多)具有不同纵轴尺度的图形(尤其是它们的起点不为零时);7. 绘制均值但未显示变异性指标。
  2. a change of scale in the middle of an axis (especially heinous in a histogram); 3. three- dimensional effects; 4. failure to show coincident points in a scatter diagram; 5. showing a fitted regression line without a scatter diagram of the raw data; 6. superimposing two (or more) graphs with different vertical scales (especially when they do not start at zero); 7. plotting means without any indication of variability.

最后一个问题很常见,但加入标准差或标准误往往导致图形杂乱。如前所述,这类数据更适合用表格呈现。对于序列数据,有更好的方法,详见第14.6节。
The last of these problems is common, and yet the addition of standard deviations or standard errors inevitably leads to a cluttered graph. As noted, this type of data may be better in a table. In the case of serial data, there are better approaches, as outlined in section 14.6.

关于这些问题及更多内容的进一步讨论,可参见Tufte(1983)、Cleveland(1984)和Wainer(1984)。
Further discussion of these and many other issues can be found in Tufte (1983), Cleveland (1984) and Wainer (1984).

16.3.6 解释错误 16.3.6 Errors in interpretation

统计分析中大多数解释错误似乎与假设检验和值有关。许多相关点已在前章讨论,但值得重申的是,值并非通常错误理解的“观察到的效应是偶然产生的概率”,而是指在原假设成立时,观察到该效应(或更极端效应)的概率。换言之,值衡量的是当总体无差异时,在样本中观察到此类效应的可能性。
It seems that the majority of errors in the interpretation of statistical analyses relate to hypothesis tests and values. Many of these points have been covered in earlier chapters, but it is worth reiterating here that the value is not, as is commonly wrongly stated, the probability that the observed effect is due to chance, but rather the probability of obtaining the observed effect (or a more unlikely one) when the null hypothesis is true. In other words, assesses how likely it is to observe such an effect in a sample when there is no such difference in the population.

另一种错误解释是认为例如代表更强的效应。虽然可能如此,但值本身并不能证明这一点。
Another false interpretation is the belief that a value of, say, 0.001 implies a stronger effect than . While this may be so, the values do not demonstrate it.

对“显著”和“不显著”值的误解广泛存在。普遍认为研究目标是获得显著结果,因此非显著结果被视为研究失败。这种态度体现在将研究结果分别称为“阳性”和“阴性”,以及将后者糟糕地描述为“未达到统计显著性”。例如,将描述为“可能显著”就是这种扭曲的表现。统计显著性常被作为唯一的解释依据。因此,任何显著效应,无论多小或多不可信,都被视为真实;任何非显著效应则被视为“无差异”。如此使用统计学即放弃了对结果进行建设性思考。
Erroneous interpretations of 'significant' and 'not significant' values abound. There is a common belief that the goal of research is a significant result, and consequently that a non- significant result implies that the research was unsuccessful. This attitude is seen in the frequent description of such study results as 'positive' and 'negative' respectively, and in the awful description of the latter as having 'failed to reach statistical significance'. An example of the contortion that this may lead to is the description of a result with as 'probably significant'. Statistical significance is often used as the sole basis of the interpretation. Thus any significant effect, however small or implausible, is taken as real, and any non- significant effect is taken as indicating that there is 'no difference'. To use statistics in this way is to abdicate from any constructive thought about one's results.

置信区间的日益使用可能减少这些困难。多家顶级医学期刊已发表社论或文章支持使用置信区间,且部分期刊现要求作者为主要结果提供置信区间(Gardner和Altman,1989a)。在比较研究中,重要的是计算组间差异的置信区间,而非单独计算各组结果的置信区间。
The increasing use of confidence intervals may reduce the difficulties. Several leading medical journals have carried editorials or articles supporting the use of confidence intervals and some now expect authors to provide them for their main results (Gardner and Altman, 1989a). It is important in comparative studies that confidence intervals are calculated for the difference between groups, not for the results in each group separately.

另一种常见的解释错误是将关联等同于因果关系。如多章所述,观察到的关联并不必然意味着存在因果关系。通常只有在设计良好的随机对照试验中,我们才能较为安全地做出这种推断,在这种试验中,任何结果差异都可视为与治疗差异因果相关。否则,解释结果时需极为谨慎,且通常无法仅凭观察结果推断因果关系,必须结合其他类型的证据。当错误推断与观察到的关联本身可能是虚假的情况(如第11.3节所述)相结合时,错误的可能性极大。
The other frequent error in interpretation is to equate association and causation. As discussed in several chapters, an observed association does not necessarily imply that there is a causal relation. The only type of study where we can usually be safe in making such an inference is a well- conducted randomized controlled trial, where any difference in outcome may be taken as causally related to the difference in treatment. Otherwise great caution is needed in the interpretation of results, and causation cannot usually be inferred without other types of evidence. When the false inference is allied to a situation where the observed association itself may be spurious, as described in section 11.3, the scope for error is enormous.

最后,我必须回到第4.3节介绍的样本与总体的基本概念。大多数医学研究基于将样本的发现推断到感兴趣总体的原则。显然,这一过程关键依赖于样本是否具有总体代表性。理论上,样本应为随机抽取,但实际上几乎从不如此。因此,在实践中,我们需要某种方法来评估样本是否可被视为具有代表性,通常通过描述样本中受试者的特征,有时还将其与已知的总体特征进行比较。如果样本不具代表性,整个统计推断过程就会失败。这也是为什么高失访率或拒绝率会严重影响研究结果的原因。
Lastly, I must return to the basic idea of sample and population introduced in section 4.3. Most medical research is based on the principle of extrapolating findings from a sample to a population of interest. Clearly this exercise is crucially dependent upon the sample being representative of the population. In theory the sample should be a random one, but this is almost never the case. In practice, therefore, we need some way of assessing whether the sample may be considered representative, and this is usually done by means of describing the characteristics of the subjects in the sample, and sometimes comparing them with the known characteristics of the population. The whole process of statistical inference fails if the sample is not representative. This is why study results are heavily compromised by high dropout or refusal rates.

16.3.7 遗漏错误 16.3.7 Errors of omission

许多文献综述中都提到重要信息被遗漏的频率。如果未明确分析方法或设计的关键方面,我们不应假设所用程序是有效的。
Many reviews of the literature have included comments on how often important information was omitted. If the methods of analysis or key aspects of the design are not specified we should not assume that valid procedures were used.

Mosteller等人(1980)研究了132项癌症对照试验,发现仅46项(35%)明确说明了统计分析方法。在主要期刊发表的临床试验中,这一比例提高到了85%(DerSimonian等,1982),但仍有改进空间(见表16.2)。如果数据是配对的或来自有序类别,了解所用方法是否合适非常重要。如果数据明显不服从正态分布,我们需要确认所用分析方法是否恰当。在更基础层面,常常不清楚报告的是标准差还是标准误。仅有少数情况下可以根据论文中提供的信息重新分析数据以验证所用方法,但这本不应成为必要。
Mosteller et al. (1980) examined 132 controlled trials in cancer and found that the method of statistical analysis was specified in only 46 (35%). The much better figure of 85% for clinical trials published in major journals (DerSimonian et al., 1982) still leaves scope for improvement (see also Table 16.2). If the data are paired or come from ordered categories it is important to know whether the methods used were appropriate. If the data are clearly not Normally distributed, we need to be assured that the method of analysis was appropriate. At a more basic level it is often unclear whether standard deviations or standard errors are presented. It is only occasionally possible to reanalyse the data from the information given in a paper and so verify which method was used, but this should not be necessary.

虽然研究设计的大致结构通常能从报告中看出,但关键细节可能缺失。常常不清楚所有观察是否来自不同个体。配对组的方法可能含糊不清,且当“配对”组大小不一致时,我们应怀疑是否真正使用了配对。在随机试验中,随机化方法常被省略。DerSimonian等(1992)发现,在他们审查的67项试验中,只有19%报告了随机化方法。这里涉及两个方面:随机数序列的生成方法和治疗分配机制。很少有论文同时给出两者,而更重要的分配机制很少披露(Altman和Doré,1990)。仅凭方法部分或标题中出现“随机化”一词,我们无法确定试验是否真正采用了随机分配(Kuntze等,1989)。核查论文是否包含必要信息的一种方法是使用清单,相关示例见16.4节。
While the broad structure of the design of a study is usually clear from the report, crucial details may be missing. It is often unclear if all the observations were taken from different individuals. Methods of matching groups may be vague, and indeed we must doubt if matching was used when, as is not uncommon, the 'matched' groups are not of the same size. In randomized trials, the method of randomization is frequently omitted. DerSimonian et al. (1992) found that the method of randomization was reported in only of the 67 trials that they examined. There are two aspects here: the method of generating the random number sequence, and the mechanism for allocating the treatments. Few papers give both of them, and the more important mechanism is rarely given (Altman and Doré, 1990). We cannot be sure that a trial really did use random allocation simply from the use of the word 'randomized' somewhere in the methods section, or even in the title (Kuntze et al., 1989).One way to see that a paper contains the necessary information is to use a checklist. Examples are discussed in section 16.4.

核查论文是否包含必要信息的一种方法是使用清单,相关示例见16.4节。
One way to see that a paper contains the necessary information is to use a checklist. Examples are discussed in section 16.4.

16.3.8 统计错误的后果 16.3.8 Consequences of statistical errors

文献综述通常报告约半数论文存在统计错误。然而,“错误”一词涵盖了各种偏离统计规范的情况,许多错误并不严重。较轻微的错误如将一些描述性信息仅以均值形式呈现而无变异性指标,通常不会引起过度担忧。对于未说明分析方法的情况,我们的反应可能取决于所用方法的“显而易见”程度。例如,2×2频数表几乎肯定使用卡方检验分析,但连续数据有多种分析选择,并非所有都适用于特定情况。
Reviews of the literature typically report that about half of the papers examined contained a statistical error. However, the term 'error' encompasses an enormous variety of deviations from statistical purity and many errors will not be serious. At the trivial end we may not worry unduly about the presentation of some descriptive information as means with no indication of variability. Our reaction to an analysis where the method is not stated may depend on how 'obvious' it is what was done. For example, two by two tables of frequencies will almost certainly have been analysed by a Chi squared test, but there are many options for continuous data, not all of which will be reasonable in any particular case.

评估约50%错误率的另一个问题是,关于何为错误尚无普遍共识。另一方面,诸如多重比较或仅选择性报告显著结果等问题,在综述中很少被考虑。
A further problem in assessing the typical error rate is that there is no general agreement on what constitutes an error. On the other hand, there are other problems that are rarely considered in reviews, such as multiple comparisons or selective reporting of only those results that are significant.

在16.1节中,我提出了统计错误的三个伦理影响:对患者的误用、资源的误用以及发表误导性结果的后果。后者可能以多种方式表现,具体取决于结果的性质:
In section 16.1 I gave three ethical implications of statistical errors: the misuse of patients, the misuse of resources and the consequences of publishing misleading results. The last of these can work in several ways, and will depend upon the nature of the results:

【1】 由于已有研究发现实验性治疗有效,尽管该研究存在缺陷,可能导致无法获得伦理委员会批准进行进一步研究;

  1. it may prove impossible to get ethics committee approval to carry out further research because a published study has found the experimental treatment beneficial, even though that study was flawed;
    【2】 其他科学家可能被引导去追随错误的研究方向;
  2. other scientists may be led to follow false lines of investigation;
    【3】 未来患者可能接受较差的治疗,这既可能是研究结果的直接后果,也可能是有效治疗引入延迟的结果;
  3. future patients may receive an inferior treatment, either as a direct

如果这些结果未被质疑,研究者可能在未来的研究中继续使用同样低劣的统计方法,其他人也可能效仿。
consequence of the results of the study or possibly by the delay in the introduction of a truly effective treatment; 4. if the results go unchallenged the researchers may use the same inferior statistical methods in future research, and others may copy them.

无论研究是否得出不恰当的结论,最后一点都是适用的。
The last of these applies whether or not the study reached inappropriate conclusions.

当然,如果所有人都能同样有效地发现错误,这些问题中的一些本可避免,但更理想的情况是期刊能够发现这些错误,从而不予发表或适当修改论文。不幸的是,几乎任何论文都能在某处发表,因此无论期刊如何加强统计审稿,问题仍将存在,批判性分析的需求也不会消失。
Some of these problems would, of course, be avoided if everyone was equally able to detect the errors, but it is much better if they can be detected by journals and the papers either not published or suitably amended. Unfortunately, almost any paper can get published somewhere, so however much journals continue to extend their statistical refereeing the problems will remain, as will the need for critical analysis.

16.3.9 为什么发表论文中存在如此多的统计错误? 16.3.9 Why are there so many statistical errors in published papers?

发表论文中的错误可归因于使用统计方法者统计知识不足,而这又源于统计教育的缺乏。本科阶段的统计教学虽能介绍一些关键统计概念,但对医学研究的需求准备不足。多项研究表明,医生对基本统计方法和理念的理解不足(Altman 和 Bland,1991)。如果简单的概念都未被很好理解,我们难以指望更复杂的方法表现更好。
Mistakes in published papers can be ascribed to inadequate understanding of statistics by those using the methods, which in turn is due to inadequate statistical education. Undergraduate teaching of statistics can introduce some of the key statistical concepts, but provides inadequate preparation for the requirements of medical research. Several studies have shown that the statistical understanding by doctors of basic statistical methods and ideas is inadequate (Altman and Bland, 1991). If simple ideas are not well understood, we can hardly expect more complex methods to fare better.

研究生课程更适合深入教授统计学,但很少有研究者参加过此类课程。Altman 和 Bland(1991)还考虑了统计广泛误用的其他原因,包括误导性的教科书和计算机程序的易得性。近年来计算机和统计软件的广泛普及使复杂分析方法得以广泛使用,但对这些技术的理解并未同步提升。
Postgraduate courses are more appropriate for in- depth teaching of statistics, but few researchers have attended such a course. Other reasons for the widespread misuse of statistics have been considered by Altman and Bland (1991), and include misleading textbooks and easy access to computer programs. The recent tremendous increase in the availability of computers and statistical software has given wide access to complex methods of analysis, but there has not been an accompanying increase in understanding of those techniques.

无论原因如何,单纯以无知作为辩护充其量也是值得怀疑的。统计设计和分析方法是严谨医学研究的基本组成部分,其使用需要与研究其他部分同等的技能。如果研究团队中缺乏这些技能,应通过寻求专家建议来获得。我们可以再次在1930年代找到有关这方面的良好建议,即布拉德福德·希尔(Bradford Hill)在《柳叶刀》首篇文章随附社论中写道:“考虑统计因素的时间是在研究计划阶段,而不是完成后”(匿名,1937)。遗憾的是,许多作者、编辑甚至伦理委员会仍未充分认识到正确统计在医学研究中的重要性。因此,阅读
Whatever the reasons, it is at best questionable whether ignorance is an adequate defence. Statistical methods of design and analysis are an essential component of sound medical research, and their use requires certain skills no less than the other components of the research. If those skills are not present within the research team they should be acquired by seeking expert advice. We can again find sound advice in this respect in the 1930s, in the editorial accompanying the first of Bradford Hill's Lancet articles: 'The time to allow for statistical factors is when an inquiry is being planned, not when it is completed' (Anon, 1937). Sadly many authors, editors, and even ethics committees remain unconvinced about the importance of correct statistics in medical research. Thus it is necessary to read

已发表的论文时应保持一定的审慎,即使是发表在最著名期刊上的文章也是如此。第16.4节提供了如何做到这一点的建议。
published papers with some circumspection, even those published in the most illustrious journals. Section 16.4 gives advice on how to do this.

16.3.10 医学期刊的角色 16.3.10 Role of the medical journals

显而易见,如果期刊停止发表存在重大错误的论文,发表论文中的统计水平将大大提高。尽管有证据表明统计审稿可以提升发表论文的质量,但期刊普遍在评判提交论文的统计部分方面行动迟缓。大多数期刊对作者的投稿指南仍然更关注参考文献的格式,而非所需的统计信息,且多数甚至未提及统计。无论喜欢与否,期刊有责任发表在科学上尽可能可靠的论文。医学文献的质量掌握在他们的编辑手中。
It is self- evident that the standard of statistics in published papers could be greatly raised if journals ceased publishing papers containing major errors. Despite evidence that statistical refereeing can improve the standard of published papers, journals have in general been slow to take steps to judge the statistical component of submitted papers. It is still true that most journals' instructions to authors give far more attention to how the references are laid out than to what statistical information is required, and most do not even mention statistics. Like it or not, journals have a responsibility for publishing papers that are, as far as can be judged, scientifically sound. The quality of the medical literature is in their editorial hands.

鉴于期刊编辑部成员中懂统计的通常不比论文作者多,统计专业知识必须通过专家审稿获得。在一项近期调查中,向98位医学期刊编辑询问其统计审稿政策,83人回复(George,1985)。仅有16%保证在发表前进行统计审查,但35%在编辑委员会中设有统计顾问或统计学家。显然,这方面有很大改进空间(Altman,1982c),尽管必须承认,可能没有足够的医学统计学家来审阅所有提交到医学期刊的论文。尽管如此,我认为所有期刊都应努力获得一定的统计意见。此外,他们应公布其统计审稿的编辑政策,而上述调查中仅有12%的期刊这样做了。
As few members of the editorial staff of journals are likely to know much more statistics than the authors of papers, the statistical expertise must be obtained through expert refereeing. In a recent survey, 98 editors of medical journals were asked about their policy on statistical refereeing and replies were obtained from 83 (George, 1985). Only had a policy that guaranteed a statistical review prior to publication, but had either a statistical consultant or a statistician on the editorial board. There is clearly considerable scope for improvement in this respect (Altman, 1982c), although it must be acknowledged that there are probably not enough medical statisticians to look at all papers submitted to medical journals. Nevertheless, I believe that all journals should endeavour to obtain some statistical input. Further, they should publish their editorial policy regarding statistical refereeing, something which only of the journals in the above- mentioned survey had done.

短期内除少数期刊外,统计水平不太可能有大幅提升,因此作为读者应保持谨慎。切勿仅凭论文摘要就接受研究结果,而应仔细评估作者的方法。下一节将提供一些指导。
It is most unlikely that there will be a great improvement in the short term except perhaps in a few journals, so it is important to be a cautious reader of published papers. You should not accept research findings solely on the basis of the abstract of a paper, but rather should make a careful assessment of the authors' methods. Some guidance is given in the next section.

16.4 阅读科学论文 16.4 READING A SCIENTIFIC PAPER

阅读论文时,拥有一份具体的检查要点清单非常有帮助。正如Colton(1974,第317页)所指出的,不可能为每篇研究论文制定一套适用的问题。
It is enormously helpful when reading a paper to have a list of specific points to be looking out for. As noted by Colton (1974, p. 317), it is impossible to produce a set of questions that it would be appropriate to ask

部分解决方案是准备多份清单;例如,Gardner等(1989)为统计审稿人制作了两份清单,一份针对一般医学研究,另一份专门针对临床试验。其他作者提出了阅读临床试验报告的指南(Simon和Wittes,1985;Grant,1989;Reisch等,1989)、流行病学研究包括病例对照研究的指南(流行病学工作组,1981;Lichtenstein等,1987;Bracken,1989),以及诊断和筛查测试评估的指南(Sheps和Schechter,1984;Wald和Cuckle,1989)。
for every research paper. A partial solution is to have more than one; for example, Gardner et al. (1989) produced two checklists for statistical reviewers, one for general medical studies and one specifically for clinical trials. Other authors have proposed guidelines for reading reports on clinical trials (Simon and Wittes, 1985; Grant, 1989; Reisch et al., 1989), epidemiological studies including case- control studies (Epidemiology Work Group, 1981; Lichtenstein et al., 1987; Bracken, 1989), and evaluation of diagnostic and screening tests (Sheps and Schechter, 1984; Wald and Cuckle, 1989).

我将采用两份不同清单的思路,分别针对一般研究和临床试验。图16.1和16.2所示的清单主要基于Gardner等(1989)的工作,但加入了一些扩展和澄清。它们设计用于协助英国医学杂志的统计审稿,但同样适用于评估已发表的论文。
I shall follow the idea of having two different checklists for general studies and for clinical trials. Those shown in Figures 16.1 and 16.2 are heavily based on those in Gardner et al. (1989), but incorporate some extensions and clarifications. They were designed to aid the statistical refereeing of papers submitted to the British Medical Journal, but are equally applicable for assessing papers already published.

图16.1中的问题可用于评估除临床试验以外的医学论文。大多数问题涉及研究设计和数据分析的重要方面。图16.2展示了用于评估临床试验报告的检查表。问题涉及上一章详细讨论过的设计和分析方面。设计中特别重要的方面与消除偏倚可能性的努力描述有关。如前所述,如果缺少相关信息,我们无法推断某种程序已被采用。例如,常见的论文将试验描述为随机且双盲,但如果仅有这三个词,而无更多信息,我们不应假设作者真正理解这些术语。常见将交替分配方法描述为随机,然而这显然不同且明显较差。Chalmers等人(1981)提出了一个更为详细的临床试验评估方案。
The questions in Figure 16.1 can be used to assess medical papers other than clinical trials. Most of the questions relate to important aspects of the design of the study and analysis of the data. Figure 16.2 shows a checklist for assessing reports of clinical trials. The questions relate to aspects of design and analysis that were discussed at length in the previous chapter. Aspects of design of particular importance relate to the description of efforts to eliminate the possibility of bias. As noted already, we cannot infer that a procedure was adopted if the relevant information is absent. For example, it is common to see papers describing trials as both randomized and double blind, but in the absence of any information beyond those three words we should not assume that the authors understand those terms. It is quite common to see methods of alternate allocation described as random; it is, of course, quite different and definitely inferior. A much more detailed assessment scheme for clinical trials was proposed by Chalmers et al. (1981).

应当理解,检查表中的许多问题没有明确的答案,因此评估论文时不可避免地带有一定主观性。尽管如此,使用这些(或其他)检查表能大大简化论文评估,部分原因是发现遗漏总比发现存在的错误更困难。
It will be appreciated that for many of the questions in the checklists there is no unequivocal answer, and assessing a paper thus involves some subjectivity. Nevertheless, the use of these (or other) checklists makes it much easier to assess a paper, partly because it is always harder to detect omissions than errors in what is present.

检查表中的问题重要性不一。理想情况下,我们希望所有问题的回答都是“是”,但很少有论文能达到这一点。实际上,我们应最关心研究设计中可能存在的偏倚。事实上,如果研究设计因某种原因不可接受,则无论数据如何分析,该论文在统计学上都是不可接受的。其次,我们希望分析方法适合数据,且结论合理。尽管报告方式不无关紧要,但显然不及方法学的基本方面重要。
The questions in the checklists are not equally important. Ideally we would like to see 'Yes' responses for all questions but few papers will achieve this. In practice we should be most concerned about possible bias in the design of the study. Indeed, if the design of the study is unacceptable for some reason, the paper is statistically unacceptable regardless of how the data were analysed. Next we would hope that the analysis was appropriate to the data, and that the conclusions were justified. Aspects of presentation, while not unimportant, are clearly less important than fundamental aspects of methodology.

研究设计 STUDY DESIGN

研究目标是否描述充分? 是 不确定 否
Is the objective of the study sufficiently described? Yes Unclear No

研究设计是否描述充分? 是 不确定 否
Is the design of the study sufficiently described? Yes Unclear No

研究设计是否适合研究目标? 是 不确定 否
Was the design of the study appropriate to the Yes Unclear No objective?

受试者来源是否描述清楚? 是 不确定 否
Is the source of the subjects clearly described? Yes Unclear No

受试者选择方法是否描述清楚(即纳入和排除标准)? 是 不确定 否
Is the method of selection of the subjects clearly Yes Unclear No described (i.e. inclusion and exclusion criteria)?

受试者样本是否适合将研究结果推广的人群? 是 不确定 否
Was the sample of subjects appropriate with Yes Unclear No regard to the population to which the findings will be referred?

样本量是否基于预先的统计功效考虑? 是 不确定 否
Was the sample size based on pre- study Yes Unclear No considerations of statistical power?

研究设计是否可接受? 是 否
Is the design of the study acceptable? Yes No

研究实施 CONDUCT OF STUDY

是否达到了令人满意的(高)响应率? 是 不确定 否
Was a satisfactory (high) response rate achieved? Yes Unclear No

分析与展示 ANALYSIS AND PRESENTATION

是否有充分描述或引用所有统计程序的说明? 是 否
Is there a statement adequately describing or Yes No referencing all the statistical procedures used?

所用统计方法是否适合数据? 是 不确定 否
Were the statistical methods used appropriate for Yes Unclear No the data?

统计方法是否正确使用? 是 不确定 否
Were they used correctly? Yes Unclear No

统计材料的展示(表格、图形、数值)是否令人满意? 是 否
Is the presentation of statistical material (tables, Yes No graphical, numerical) satisfactory?

是否提供了足够的分析? 是 不确定 否
Are sufficient analyses presented? Yes Unclear No

是否给出了主要结果的置信区间? 是 否
Are confidence intervals given for the main Yes No results?

综合评估 OVERALL ASSESSMENT

从统计分析得出的结论是否合理? 是 不确定 否
Are the conclusions drawn from the statistical Yes Unclear No analyses justified?

论文在统计学上是否可接受? 是 否
Is the paper statistically acceptable? Yes No

图16.1 一般医学论文评估检查表
Figure 16.1 Checklist for assessment of general medical papers.

研究设计 STUDY DESIGN

试验的目标是否描述充分? 是 不确定 否
Is the objective of the trial sufficiently Yes Unclear No described?

研究设计是否描述充分? 是 不确定 否
Is the design of the study sufficiently Yes Unclear No described?

是否对纳入试验的诊断标准作了令人满意的说明? 是 不确定 否
Is there a satisfactory statement of the Yes Unclear No diagnostic criteria for entry into the trial?

受试者来源是否清楚描述? 是 不确定 否
Is the source of the subjects clearly Yes Unclear No described?

治疗措施是否定义明确?是 不确定 否
Were the treatments well defined? Yes Unclear No

治疗组是否同时进行研究?是 不确定 否
Were the treatment groups studied Yes Unclear No concurrently?

是否采用随机分配治疗?是 不确定 否
Was random allocation to treatment used? Yes Unclear No

是否描述了随机分配的方法?是 不确定 否(例如随机数字表)
Is the method of creating the randomisation Yes Unclear No (e.g. tables of random numbers) described?

是否描述了治疗分配的机制?是 不确定 否(例如密封信封)
Is the mechanism of treatment allocation Yes Unclear No (e.g. sealed envelopes) described?

治疗分配机制是否设计为消除偏倚?是 不确定 否
Was the mechanism of treatment allocation Yes Unclear No designed to eliminate bias?

从分配到开始治疗的延迟是否在可接受的范围内?是 不确定 否
Was there an acceptably short delay from Yes Unclear No allocation to commencement of treatment?

试验过程中是否采用了潜在的盲法程度?是 不确定 否
Was the potential degree of blindness used Yes Unclear No during the trial?

是否对结局指标的标准做出了满意的说明?是 不确定 否
Is there a satisfactory statement of criteria Yes Unclear No for outcome measures?

结局指标是否适当?是 不确定 否
Are the outcome measures appropriate? Yes Unclear No

是否有基于统计功效考虑的研究前样本量计算描述?是 否
Is there a description of a pre- study Yes No calculation of sample size based on considerations of statistical power?

是否说明了治疗后随访的持续时间?是 不确定 否
Is the duration of post- treatment follow- up Yes Unclear No stated?

研究设计是否可接受?是 否
Is the design of the study acceptable? Yes No

研究实施 CONDUCT OF STUDY

是否有较高比例的受试者被随访?是 不确定 否
Were a high proportion of subjects followed Yes Unclear No up?

是否有较高比例的受试者完成治疗?是 不确定 否
Did a high proportion of subjects complete Yes Unclear No treatment?

是否分别描述了各治疗组的脱落情况?是 不确定 否
Are drop- outs described separately for each Yes Unclear No treatment group?

是否分别描述了各组的治疗副作用?是 不确定 否
Are side- effects of treatment described Yes Unclear No separately for each group?

分析与呈现 ANALYSIS AND PRESENTATION

是否有充分描述或引用所有统计程序的说明?是 否
Is there a statement adequately describing Yes No or referencing all the statistical procedures used?

各组的基线特征是否充分展示?是 否
Are the baseline characteristics of each Yes No group presented adequately?

使用的统计方法是否适合数据?是 不确定 否
Were the statistical methods used Yes Unclear No appropriate for the data?

统计方法使用是否正确?是 不确定 否
Were they used correctly? Yes Unclear No

预后因素是否得到充分考虑?是 不确定 否
Have prognostic factors been adequately Yes Unclear No considered?

统计资料(表格、图形、数值)的呈现是否令人满意?是 否
Is the presentation of statistical material Yes No (tables, graphical, numerical) satisfactory?

是否提供了足够的分析?是 不确定 否
Are sufficient analyses presented? Yes Unclear No

主要结果是否给出了置信区间?是 否
Are confidence intervals given for the main Yes No results?

综合评价 OVERALL ASSESSMENT

从统计分析中得出的结论是否合理?是 不确定 否
Are the conclusions drawn from the Yes Unclear No statistical analyses justified?

论文的统计方法是否可接受?是 否
Is the paper statistically acceptable? Yes No

图16.2 临床试验报告评估清单。
Figure 16.2 Checklist for assessment of reports of clinical trials.

16.5 撰写科学论文 16.5 WRITING A SCIENTIFIC PAPER

医学期刊对作者在论文中统计部分的指导很少。显然,阅读论文和撰写论文在某些方面有许多相似之处,图16.1和16.2中的清单应能很好地展示论文中应包含的信息类型。Altman等人(1989)提供了更为全面的作者指南,涵盖统计学的各个方面。关于研究统计部分应遵循的基本原则是,方法应描述得足够详细,以便完全理解,并使任何有原始数据的人都能在需要时复现你的结果。简单来说,你应清晰描述具体做了什么。
Medical journals give authors scant guidance regarding the statistical aspects of papers. Clearly there is much similarity between aspects of reading a paper and those of writing a paper, and the checklists in Figures 16.1 and 16.2 should give a good idea of the type of information that should be included in a paper. A much more comprehensive set of guidelines for authors is given by Altman et al. (1989), covering all aspects of statistics. The basic principle to be adhered to with respect to the statistical aspects of your research is that the methods should be described in sufficient detail to be fully understood, and so that anyone else with access to your raw data could, if desired, reproduce your results. Put more simply, you should describe clearly exactly what was done.

如果你的研究目标明确,设计合理且样本量充足,进行了适当的分析并从结果中得出了合理的推论,你应该能够在优秀期刊发表论文。最重要的方面无疑是设计,因为数据总是可以重新分析。虽然希望读完本书后你能进行合理的数据分析,但你可以从以下评论中获得安慰:
If your research had a useful objective, you have used a sensible design and adequate sample size, you have performed appropriate analyses and drawn reasonable inferences from your findings you should be able to get your paper published in a good journal. The most important aspect is unquestionably the design, because it is always possible to reanalyse your data. While I hope that after reading this book you will be able to carry out sensible analyses of your data, you may take comfort from the following comment:

在担任一个主要同行评审期刊编辑的六年里,我不记得曾因统计计算错误而单独拒绝过论文,但大量的拒稿是因为统计设计和概念上的更根本问题,这些问题事后很难弥补。
I cannot recall that during six years as editor of a major peer- reviewed journal I ever turned down a paper solely because statistical computations were in error, but a large proportion of disapprovals resulted from more fundamental problems of statistical design and concept, which can rarely be remedied after the fact.

(Bailar, 1986)
(Bailar, 1986)

统计学是一个相对较新的领域,和医学一样,也受到时尚潮流的影响。近年来,许多领先的医学期刊在政策上发生了重大转变,开始鼓励甚至要求作者在呈现主要结果时使用置信区间。这些方法已经存在了几十年,但直到最近才被医学界接受。或许这一过程的延续将是对假设检验及其伴随的大量 值和星号的使用减少,这些在大多数医学研究论文中随处可见(Evans 等,1988)。我不会预测这种发展,但未来几年医学期刊中统计学的使用很可能会继续发生变化。我的希望是,我们能广泛减少重要统计错误的发生,这些错误的存在损害了文献的完整性,最终可能对患者产生不利后果。继监管机构对新药评估研究的统计质量产生显著影响之后,越来越多的迹象表明人们对良好设计和正确分析的重要性有了更广泛的认识。
Statistics is a relatively new field, and like medicine is subject to fashion. In recent years there has been a major shift in policy by many leading medical journals towards encouraging or even requiring authors to use confidence intervals when presenting their main results. The methods have been around for many decades, but have only belatedly become accepted in medicine. Perhaps a continuation of this process would be a decline in the use of hypothesis tests and the plethora of values and asterisks that adorn most medical research papers (Evans et al., 1988). I shall not predict such a development, but it seems likely that the next few years will see further changes in the use of statistics in medical journals. My hope is that we will begin to see a widespread reduction in the frequency of important statistical errors, whose presence compromises the integrity of the literature and may, ultimately, lead to adverse consequences for patients. Following the marked effect on the statistical quality of studies evaluating new drugs as a result of requirements of the regulatory authorities, there are signs of a more widespread increase in awareness of the importance of

为此,我希望本书能够成功传达医学研究统计部分的关键概念。例如,记住比例置信区间的公式(或任何公式)并不重要—你总可以查阅。重要的是理解置信区间的含义。更广泛地说,理解研究设计和统计推断背后的关键统计概念是必不可少的。
good design and correct analysis. To this end I hope that in this book I have succeeded in getting across the important concepts that underlie the statistical component of medical research. For example, it is not important to remember the formula for the confidence interval for a proportion (or any formula for that matter) - you can always look it up. It is important to understand what the confidence interval means. More generally it is essential to understand the key statistical concepts underlying study design and statistical inference.

我将以 Mainland(1950年)的一些评论作结,他是一位解剖学家,后来也成为了统计学教授。这些评论至今仍然适用:
I shall close with some comments of Mainland (1950), an anatomist who subsequently became a professor of statistics too. They are still as relevant as when he wrote them:

最后,必须再次强调,无论找到什么样的帮助资源,采用何种技术,研究者本人都必须掌握统计推理的原则……现代统计原则不是我们可以随意取舍的,它们构成了所有领域研究者的逻辑基础,包括临床研究领域。
Finally, it must be stressed again that, whatever sources of help are found and whatever techniques are employed, the investigator himself has to grasp the principles of statistical reasoning … modern statistical principles are not something that we can take or leave as we wish, for they comprise the logic of the investigator in all fields, including the field of clinical research.

练习题 EXERCISES

(这些问题并非本章内容特有。)
(These problems are not specific to the material in this chapter.)

16.1 如果两项研究的结果分别得到 ,为何如16.3.6节所述,前者不一定比后者发现了更强的效应?
16.1 If two studies' results yield and , why is it not necessarily true, as noted in section 16.3.6, that the former has found a stronger effect than the latter?

16.2 如果两项相同的研究结果分别得到 ,造成如此大差异的可能原因有哪些?
16.2 If two identical studies yield and , what are the possible explanations for the large difference?

16.3 某人群中成年男性和女性的身高(单位:厘米)均值和标准差如下:
16.3 The mean and standard deviation of the heights (in cm) of adult men and women in a population are as follows

均值标准差
男性179.15.84
女性171.75.75
MeanSD
Men179.15.84
Women171.75.75

假设两性身高均服从正态分布,
Assuming that for both sexes height has a Normal distribution,

(a) 有多少比例的女性身高超过男性的平均身高?
(a) What proportion of women are above the average height of men?

(b) 若成年人中女性占60%,身高超过182.9厘米(六英尺)的成年人中女性占多少比例?
(b) If of adults are female, what proportion of adults taller than (six feet) are women?

16.4 下表显示了1931-1935年与1983年英格兰和威尔士女性的按年龄分组及总年度死亡率(每千人)。(英国统计局)
16.4 The following table shows age- specific and total annual death rates per 1000 females in England and Wales in 1931- 5 and 1983 (Office of

人口普查与调查(死亡率统计)。
Population Censuses and Surveys Mortality Statistics).

年龄每千人年死亡率
1931–5年1983年
< 1岁549
1–4岁6.20.4
5–9岁2.10.2
10–14岁1.40.2
15–19岁2.20.3
20–24岁2.80.3
25–34岁3.10.5
35–44岁4.31.2
45–54岁8.03.6
55–64岁1710
65–74岁4324
75–84岁10964
85岁以上245176
所有年龄11.411.4
AgeAnnual death rates/1000
1931–51983
&lt; 1549
1– 46.20.4
5– 92.10.2
10–141.40.2
15–192.20.3
20–242.80.3
25–343.10.5
35–444.31.2
45–548.03.6
55–641710
65–744324
75–8410964
85+245176
All ages11.411.4

在这五十年间,每个年龄组的死亡率都有大幅下降。那为什么所有年龄段女性的总体每千人死亡率却保持不变呢?
Over the fifty years there was a large decline in the death rate in every age group. What would explain the fact that the overall death rate per 1000 women of all ages was unchanged?

【16】5 下表显示了按社会阶层划分的五岁儿童中,患有五颗或以上龋齿、缺失或补牙的百分比(% dmft),分别针对一直居住在含氟水区或邻近非含氟水区的儿童(Carmichael 等,1989)。
16.5 The following table shows the percentage of five year olds with five or more decayed, missing or filled teeth by social class, separately for children who had lived continuously in either an area with fluoridated water or in a nearby non- fluoridated area (Carmichael et al., 1989).

社会阶层含氟区% dmft 非氟化处理
I-II1021
III1533
IV-V2145
未分类2047
Social classFluoridated% dmft Non-fluoridated
I-II1021
III1533
IV-V2145
Unclassified2047

(a) 使用卡方检验比较两个地区按社会阶层划分的龋齿指数(dmft百分比)是否合理?
(a) Is it reasonable to use a Chi squared test to compare dmft by social class in the two areas?

(b) 儿童的实际人数如下:
(b) The actual numbers of children were as follows:

社会阶层氟化处理龋齿指数(dmft) 非氟化处理
I-II12/11712/56
III26/17048/146
IV-V11/5229/64
未分类24/11849/104
合计73/457138/370
Social classFluoridateddmft Non-fluoridated
I-II12/11712/ 56
III26/17048/146
IV-V11/ 5229/ 64
Unclassified24/11849/104
Total73/457138/370

计算两个地区儿童总群体中 dmft 差异的 95% 置信区间。
Calculate a confidence interval for the difference in dmft in the total groups of children in the two areas.

(c) 在两个地区中, dmft 与社会阶层之间是否存在显著关系?
(c) Is there a significant relation between dmft and social class within each of the two areas?

(d) 我们如何评估该关系在无氟区是否比氟化区更强?
(d) How might we assess whether the relation is stronger in the non-fluoridated than the fluoridated area?

(e) 这种效应叫什么名称?
(e) What is the name for such an effect?

【16】6 50 名孕妇的收缩压(SBP)在同一只手臂上同时用动脉内测量(直接法)和血压计(间接法)测量(Raftery 和 Ward,1968)。测量了手臂围度和体重,因为有人建议这些因素可能影响两种测量方法之间的差异。数据见下表:
16.6 The systolic blood pressure (SBP) of 50 pregnant women was measured simultaneously in the same arm using intra- arterial measurement (direct) and a sphygmomanometer (indirect) (Raftery and Ward, 1968). Arm circumference and weight were measured because there was some suggestion that they might affect the difference between the two measurements. The data are shown in the following table:

年龄体重 (kg)手臂围度 (cm)收缩压间接测量 (mm Hg)差值 (I-D)
13278.229115
22567.825122
33571.726118
44160.824127
53078.733110
62987.831146
72068.926127
83870.525126
93168.02981
103972.628127
113653.331127
122353.32380
132546.02289
AgeWeight (kg)Arm circumference (cm)Systolic BP indirect (mm Hg)Diff (I-D)
13278.229115
22567.825122
33571.726118
44160.824127
53078.733110
62987.831146
72068.926127
83870.525126
93168.02981
103972.628127
113653.331127
122353.32380
132546.02289
年龄体重 (kg)手臂围度 (cm)收缩压间接测量 (mm Hg)差值 (I-D)
143565.926136
152668.026105
162373.02999
171965.626129
182159.92598
193177.829115
203082.031169
213965.826107
223063.525166
233563.62893
243473.627115
253062.12593
262681.133118
272970.530116
282765.826111
291977.631159
302158.124110
314476.22493
322058.42593
333359.228117
344159.827120
352854.923114
362879.428132
371864.926157
381867.625109
393261.027157
402087.043126
412151.52383
423181.629116
432972.131158
442195.331118
452874.228123
462279.627154
472070.526126
482879.526119
491960.92473
502077.627116
AgeWeight (kg)Arm circumference (cm)Systolic BP indirect (mm Hg)Diff (I-D)
143565.926136
152668.026105
162373.02999
171965.626129
182159.92598
193177.829115
203082.031169
213965.826107
223063.525166
233563.62893
243473.627115
253062.12593
262681.133118
272970.530116
282765.826111
291977.631159
302158.124110
314476.22493
322058.42593
333359.228117
344159.827120
352854.923114
362879.428132
371864.926157
381867.625109
393261.027157
402087.043126
412151.52383
423181.629116
432972.131158
442195.331118
452874.228123
462279.627154
472070.526126
482879.526119
491960.92473
502077.627116

(a) 进行适当的分析以量化直接测量和间接测量的收缩压之间的一致性。
(a) Carry out an appropriate analysis to quantify the agreement between the directly and indirectly measured systolic blood pressures.

(b) 利用 (a) 的结果,估计有多少比例的女性会…
(b) Using the results from
(b) 利用 (a) 的结果,估计有多少比例的女性会…
(a), for what proportion of women would

这两种方法的差异是否预计在 以内?
the difference between the methods be expected to be within

(c) 差异与体重、臂围或年龄之间是否存在关系?
(c) Is there any relation between the differences and weight, arm circumference or age?

(d) 上表显示了按女性受试者研究顺序排列的数据。下表显示了按每10人一组计算的间接法与直接法收缩压差异的均值和标准差:
(d) The above table shows the data in the order in which the women were studied. The following table shows the mean and standard deviation of the indirect-direct differences in systolic blood pressure for the women taken in blocks of 10:

女性收缩压差异 (mm Hg)
均值标准差
1–10-12.314.0
11–20-4.711.1
21–30-5.29.1
31–40-1.814.3
41–500.011.8
总计-4.812.5
WomenDifference in SBP (mm Hg)
MeanSD
1–10-12.314.0
11–20-4.711.1
21–30-5.29.1
31–40-1.814.3
41–500.011.8
Total-4.812.5

收缩压均值差异的变化可能由什么原因解释?
What might explain the variation in the mean difference in SBP?

(e) 重复
(e) Repeat
(a) 和
(a) and
(b) ,排除前十名女性,并将结果与全部50名女性的结果进行对比。
(b) excluding the first ten women and contrast the answers with those obtained for all 50 women.

【16】7 作为急性高山病研究的一部分,东非攀登队的15名成员在快速升至 前后测量了血浆醛固酮水平(Milledge 等,1989)。下表显示了三天内低海拔和高海拔09:00的测量值,以及急性高山病(AMS)症状评分—数值越高表示症状越严重。
16.7 As part of a study of acute mountain sickness fifteen members of a climbing expedition to East Africa had plasma aldosterone measurements taken before and after a rapid ascent to (Milledge et al., 1989). The following table shows measurements taken at 09.00 at low and high altitudes over three days, together with a symptom score for acute mountain sickness (AMS) - high values mean worse symptoms.

受试者AMS评分低 第1天血浆醛固酮(mmol/l)
高 第2天高 第3天
1168151188
211534977
3711014195
49238286143
51120424284
61114126363
711183233121
SubjectAMS scoreLow Day 1Plasma aldosterone (mmol/l)
High Day 2High Day 3
1168151188
211534977
3711014195
49238286143
51120424284
61114126363
711183233121
受试者AMS评分血浆醛固酮(mmol/l)
低 第1天高 第2天高 第3天
81311924597
913272275115
1014166241150
111522810976
12177719263
13181144643
14239118974
1535105254283
SubjectAMS scorePlasma aldosterone (mmol/l)
Low Day 1High Day 2High Day 3
81311924597
913272275115
1014166241150
111522810976
12177719263
13181144643
14239118974
1535105254283

(a) 对三天的血浆醛固酮水平进行适当的分析。哪几天之间的差异具有统计学显著性?
(a) Carry out an appropriate analysis of the plasma aldosterone levels on the three days. Which pairs of days are significantly different?

(b) 获取第1天和第2天血浆醛固酮水平差异的95%置信区间。
(b) Obtain a confidence interval for the difference between plasma aldosterone levels on days 1 and 2.

(c) 检查AMS评分与第1天和第2天血浆醛固酮变化之间的关联。
(c) Examine the association between the AMS score and the change in plasma aldosterone between days 1 and 2.

16.8 针对对一项针对肠易激综合征患者饮食中麸皮饼干与安慰剂饼干的交叉临床试验分析的批评,第一作者回应道:
16.8 In response to the criticism of the analysis of a crossover clinical trial of bran biscuits versus placebo biscuits in the diet of patients with irritable bowel syndrome, the first author wrote:

“然而,在像我们这样的研究中,鉴于样本量相对较小,置信区间……很可能包含零,区间较宽,并且包括正负值。因此,这种分析方法在此情境下并不合适,因为结果总是过于分散,难以具有实际意义。”
'In studies such as ours, however, given the relatively small numbers, a confidence interval … is likely to contain zero, be fairly wide and include both positive and negative values. Therefore this is not an appropriate setting for this form of analysis as the result always will be too diffuse to be meaningful'

(Lucey, 1987).
(Lucey, 1987).

这是一个合理的论点吗?
Is this a reasonable argument?

附录 A 数学符号 Appendix A Mathematical notation

A1.1 引言 A1.1 INTRODUCTION

本书中反复使用许多数学表达式,这些将在本附录中进行解释。数学符号可能令人困惑,因为相同的字母在不同情境下代表不同的量,且相同的符号可能有不同的用法。此外,同一表达式可能有多种表示方法。另一个问题是,虽然通常存在标准符号,但很多情况下并无统一标准。因此,在查阅两本或更多教材时,可能会因表达方式不同而感到困惑。为此,下面的一些条目会提及在其他地方可能遇到的常见替代符号,尽管它们未在本书中出现。接下来的三个部分讨论基本概念及符号和函数的使用,之后是符号词汇表。
In this book repeated use is made of many mathematical expressions. These are explained in this appendix. Mathematical notation can be confusing, with the same letters used to denote different quantities in different situations, and with the same symbols used in different ways. Also, there may be several ways of depicting the same expression. A further problem is that while there is often a standard notation, in many cases there is not. Thus it can be confusing to look up the same item in two or more textbooks because they use different ways of expressing the same formula. To help a little, some entries below refer to common alternative forms of notation that may be encountered elsewhere, although they do not appear in this book.The next three sections discuss basic ideas and the use of symbols and functions, after which there is a glossary of notation.

接下来的三个部分讨论基本概念及符号和函数的使用,之后是符号词汇表。
The next three sections discuss basic ideas and the use of symbols and functions, after which there is a glossary of notation.

A1.2 基本概念 A1.2 BASIC IDEAS

A1.1.2 变量 A1.1.2 Variables

当我们使用数学公式时,需要一种简便的方式来表示变量的值。例如,如果我们想用公式表达通过用死亡年份减去出生年份来计算一个人的死亡年龄的想法,我们用字母代替每个变量。传统上,我们常用 来表示变量,因此可以将上述计算写为[ X = Y - Z ]其中 表示死亡年龄, 表示死亡年份, 表示出生年份。通常(但非绝对)用大写字母表示变量,小写字母表示该变量的具体值。为了表示变量的某个特定值,通常使用下标。因此,为了表示第四个人的变量 的值,
When we use a mathematical formula we need a simple way to refer to the values of a variable. For example, if we wish to express as a formula the idea that we calculate a person's age at death by subtracting their year of birth from the year in which they died, we replace each variable by a letter. Traditionally we often use and to indicate variables, so we could write the above calculation as[ X = Y - Z ]where represents age at death, represents year of death, and represents year of birth. It is common, but not universal, to use a capital letter to indicate a variable, and a small letter to indicate a value of that variable.To denote a particular value of a variable we usually use a subscript. Thus to indicate the value of the variable for the fourth person in a

其中 表示死亡年龄, 表示死亡年份, 表示出生年份。通常(但非绝对)用大写字母表示变量,小写字母表示该变量的具体值。
where represents age at death, represents year of death, and represents year of birth. It is common, but not universal, to use a capital letter to indicate a variable, and a small letter to indicate a value of that variable.

为了表示变量的某个特定值,通常使用下标。因此,为了表示第四个人的变量 的值,
To denote a particular value of a variable we usually use a subscript. Thus to indicate the value of the variable for the fourth person in a

样本中,我们写作 。在前面的例子中, 表示第四个人的死亡年龄。
sample, we write . In the previous example represents the age at death of the fourth person.

我们常常希望表示一个未指定个体的数值,这时用 表示样本中第 i 个个体的变量 的值。字母 经常以这种方式使用。
Often we wish to denote the value for an unspecified individual, in which case we use to indicate the value of the variable for the 'ith' subject in the sample. The letters and are often used in this way.

不幸的是,另一种下标的用法可能导致混淆。当我们有多个变量时,使用下标表示变量编号是很方便的,比如 等。下标的具体含义应从上下文中明确。
Unfortunately, a different use of subscripts can cause confusion. When we have many variables it is convenient to use subscripts to indicate the number of the variable, such as , , , and so on. The exact meaning of the subscript ought to be clear from the context.

A1.2.2 统计量 A1.2.2 Statistics

从原始数据得出的汇总值称为统计量—例如均值、标准差和比例。我们也用字母在公式中表示这些值。变量 的均值记作 (读作“x bar”),标准差记作 ,比例通常记作 。当在同一公式中需要表示多个同类统计量时,我们用不同的下标。例如, 分别表示两个样本中观察到的比例。同样,下标的含义应从上下文中明确。
Summary values derived from the raw data are called statistics - examples are means, standard deviations and proportions. We also use letters to denote these values in formulae. The mean of a variable called is denoted (and pronounced 'x bar'), the standard deviation is denoted , and a proportion is usually denoted as . When we need to refer to more than one statistic of the same type in the same formula we use different subscripts. For example, we might use and to denote observed proportions in two samples. Again the meaning of the subscript should always be clear from the context.

A1.2.3 乘法 A1.2.3 Multiplication

乘法在本书中大量公式中出现。表示乘法的方法有几种。除了常见的乘号 ,有时用句点表示乘法,而在计算机编程中(但非一般用法)用星号 * 表示。最令人困惑的是,有时根本不使用符号,直接将两个相邻量相乘。这是因为乘号与字母 非常相似,而句点可能被误认为是小数点。
Multiplication features in a high proportion of the formulae used in this book. There are several alternative methods of indicating that quantities are multiplied together. Apart from the usual multiplication sign, , we sometimes use a full stop, while in computer programming (but not in general use) we use an asterisk, *. Most confusingly, sometimes we use no symbol at all, relying on the idea that we multiply two adjacent separate quantities in a formula. This is because the multiplication sign looks very similar to the letter which is used a great deal in formulae, and a full stop could be confused with the decimal point.

例如,我们有
Thus, for example, we have

最后一种用法,即不使用符号表示乘法,是最常见的方法。因此,当我们将两个量 相乘时,写作 。这也是为什么用单个字母表示变量。
The last usage, without a symbol to indicate the multiplication, is the most common method. Thus when we multiply two quantities such and we write the product as . This is why we use a single letter to denote a variable.

A1.2.4 括号 A1.2.4 Brackets

括号用于将表达式分组,通常涉及加法或减法,其中整个表达式是更复杂公式的一部分。
Brackets are used for grouping expressions, usually involving addition or subtraction, where the whole expression is part of a more complicated

前一节中给出了一个简单的例子;一个更复杂的例子是
formula. A simple example was given in the preceding section; a more complicated example is

这是第10章中计算的一个量,其中四个由两个频数之和组成的项相乘。
a quantity calculated in Chapter 10 in which four sums of two frequencies are multiplied.

括号内的量应先于计算的其他部分进行计算。因此,如果我们希望计算
Quantities within brackets should be calculated before other parts of the calculation. Thus if we wish to evaluate

其中 ,则有
where and , we have

对于复杂的公式,我们常常需要在一个括号内再嵌套一组括号。为了便于阅读,我们使用不同类型的括号,通常是圆括号套方括号,再套花括号。一个例子是
For complicated formulae we often need to have one set of brackets within another. To make these easier to read we use different types of brackets, and usually have round brackets within square brackets within curly brackets. An example is

A1.2.5 除法 A1.2.5 Division

公式中表示除法有两种方式。例如,要表示 除以 ,我们可以写成 。第一种方法中的括号是必需的,用以区分 。除号上方的量称为分子,下方的量称为分母。
There are two ways of denoting division in formulae. To show, for example, the quantity divided by we can write either or . The brackets in the first method are essential to distinguish from . The upper quantity in a division is the numerator and the lower quantity is the denominator.

如果分母包含多个元素,则可能需要括号。例如,要表示 除以 ,我们使用 。数学符号中通常不使用
If the denominator involves multiple elements then brackets may be needed. For example, to denote divided by we use either or . The symbol is not usually used in mathematical notation.

A1.2.6 幂和平方根 A1.2.6 Powers and square roots

当我们将一个数乘以自身时,得到该数的平方;如果再乘以原数,则得到立方。例如,一个边长为4.2米的房间,其面积是 平方米。如果房间高度也是4.2米,则体积是 立方米。
When we multiply a quantity by itself we get the square of the original value, and if we multiply the result by the original value again we get its cube. Thus if we have a room that is 4.2 metres square, its area would be square metres. If it is also 4.2 metres high, its volume is cubic metres.

我们用上标2表示平方,用上标3表示立方。因此,房间的地面积是 平方米,体积是 立方米。上标表示需要将数值自乘的次数,称为幂。更一般地,我们写成 表示 次幂。
We denote the square of a number by a superscript of 2, and a cube with a superscript of 3. The floor area of the room is thus square metres and its volume is cubic metres. The superscript indicates the number of times we must multiply the value by itself, and is known as the power. More generally we write to indicate the value of to the power

有时需要计算 时, 的值为1,适用于任意
. Sometimes we need to evaluate when . The value of is 1, for any value of .

平方根是逆过程。一个数的平方根是指平方后得到该数的那个数。例如,以上例子中,4.2是17.64的平方根。我们写作 。另一种有时见到的表示法(但本书不使用)是 。类似地, ,称为 的倒数,也可写成
The square root involves the reverse process. The square root of a number is the number that when squared gives the first number. For example, using the above example, 4.2 is the square root of 17.64. We write this as . Alternative notation sometimes seen (but not used in this book) is . Similarly, the quantity , which is known as the reciprocal of , may be written as .

一个结合了迄今为止讨论的各种特征的例子是
An example that combines the various features discussed so far is

A1.2.7 求和 A1.2.7 Summation

统计公式的一个常见特征是需要表示若干项的总和。例如,一组观测值的均值是所有观测值之和除以观测值的数量。如果我们有 个观测值,分别记为 ,那么如第3章所述,我们可以计算均值 ,公式为
A common feature of statistical formulae is the need to indicate the sum of a number of items. For example, the mean of a set of observations is calculated from the sum of all the observations divided by the number of observations. If we have observations denoted by then, as described in Chapter 3, we can calculate the mean, , as

但这写法冗长。我们使用“求和符号” (希腊字母大写西格玛)表示“求和”,可以简写为
but this is long- winded. We use the 'summation sign' (the Greek capital sigma) to indicate 'sum of', and can abbreviate the expression to

求和符号上下的符号表示被加数的取值范围。实际上,这些取值通常很明显,因此我们常用简写 。与前面讨论的例子类似,我们用括号来明确求和的内容。因此
where the symbols below and above the sigma indicate the range of values being added. In practice, it is usually obvious what these values are, so we use the shorthand or . As with other examples already discussed, we use brackets to clarify what is being summed. Thus

表示对每个 计算 ,将结果平方后对所有 求和。
indicates that we calculate for each value of , square them, and add them up for all values of .

注意 ,这与 不同。
Note that and is not the same as .

有时我们使用两个 符号来表示双重求和。例如,表达式
Sometimes we use two signs to indicate double summation. For example, the expression

表示右侧的表达式对所有 从 1 到 以及 从 1 到 的组合进行求和。注意对应的双下标的使用。该公式出现在第 10.6.6 节。
means that the expression on the right is added for every combination of values of from 1 to and from 1 to . Note the corresponding use of double subscripts. This formula appears in section 10.6.6.

A1.2.8 乘积 A1.2.8 Products

有时我们需要表示多个项的乘积;也就是说,我们需要将它们全部相乘。如果我们有 个观测值,记为 ,那么我们可以计算它们的连乘积为
We sometimes need to indicate the product of a number of items; that is, we need to multiply them all together. If we have observations denoted as , then we can calculate their multiple product as

但这样写很冗长。我们使用 (希腊大写字母 pi)表示“乘积”,可以将表达式简写为
but this is long- winded. We use (the Greek capital pi) to indicate 'product of', and can abbreviate the expression to

其中 pi 字母上下的符号表示乘积的取值范围。实际上,与求和类似,这些取值通常是显而易见的,因此我们使用简写
where the symbols below and above the letter pi indicate the range of values being multiplied. In practice, as with summation, it is usually obvious what these values are, so we use the shorthand or .

A1.2.9 阶乘 A1.2.9 Factorials

另一种乘积是阶乘。例如,我们写作 5!(读作“五的阶乘”),表示 。一般地, 表示从 1 到 所有整数的乘积。我们定义 。本书中阶乘用于费舍尔精确检验,见第10.7.3节。
Another sort of product is the factorial. We write, for example, 5! (pronounced 'five factorial') to mean . In general means the product of all the integers from 1 up to . We define . Factorials are used in this book for Fisher's exact test, in section 10.7.3.

A1.3 数学符号 A1.3 MATHEMATICAL SYMBOLS

表示绝对值,即忽略符号的数值大小。例如,
indicates the absolute value of the quantity between the vertical lines; that is, the sign is ignored. For example,

表示“加或减”。例如,表达式 两个数的简写。
indicates 'plus or minus'. For example, the expression is shorthand for the two quantities and .

上划线(如 )表示字母所代表变量的均值,这里是 的均值。
Bar (e.g. ) indicates the mean of the variable denoted by the letter, here .

帽子符号(如 )表示对字母所代表量的估计值,这里是 的估计值。
Hat (e.g. ) indicates an estimate of the quantity denoted by the letter, here .

用于表示不等式:
and are used to indicate inequalities:

大于
is greater than

小于或等于
is less than or equal to

大于或等于
is greater than or equal to

A1.4 函数 A1.4 FUNCTIONS

另一种常见的统计符号是数学函数。此符号表示一种通用关系。例如,如果我们定义一个函数 使得 ,那么我们可以写成 ,表示 。这里 简单表示对 的一个指定函数或变换。本书中最常用的函数是 ,表示对数变换。使用这种符号时,函数名(这里是“log”)描述了对括号内数值所做的操作。此处括号的用法与前面不同;特别地,我们不将 解释为 f 乘以 。更复杂的是,有时我们省略括号。因此, 常写作 log
Another common type of statistical notation is the mathematical function. This notation indicates a general relationship. For example, if we define a function so that , then we can write to mean . Here simply means a specified function or transformation of . The most common function used in this book is indicating the logarithmic transformation. With this type of notation, it is understood that the name of the function, here 'log', describes what is done to the value in brackets. This use of brackets is thus different from that given above; in particular we do not interpret as f multiplied by . To confuse matters further, in some cases we omit the brackets. Thus is often written simply as log .

A1.4.1 对数 A1.4.1 Logarithms

对数在统计学中主要用于将一组观测值转换为更方便的分布,特别是使偏斜分布更接近正态分布。数量 的对数(log)是值 ,满足 。这里的 e 是常数 2.718281…。1 的对数为 0,0 的对数为负无穷 。对数变换仅适用于所有值均为正的数据。 称为以 e 为底的自然对数,有时写作 。我们有时使用以 10 为底的对数,此时 是满足 的值 。使用以 10 为底的对数的优点是数字 10、100、1000 等变为 1、2、3 等。然而,以 e 为底的对数更常用,且可能是计算机软件中唯一可用的选项。不同底数的对数转换效果无差别,只是数值相差一个常数倍。若以对数单位报告数值,明确底数非常重要。必要时我们使用括号以明确含义,如
Logarithms are mainly used in statistics to transform a set of observations to values with a more convenient distribution, in particular to make a skewed distribution closer to a Normal distribution. The logarithm (log) of a quantity is the value such that . Here e is the constant 2.718281. . . . The log of 1 is 0 and the log of 0 is minus infinity . Log transformation can be used only for data where all values are positive. is known as the natural logarithm of to the base e, and is sometimes written . We sometimes use logarithms to the base 10, in which case is the value such that . The advantage of using logs to base 10 is that the numbers 10, 100, 1000, etc. become 1, 2, 3, etc. However, the use of logs to base e is much more common, and may be the only option in a computer package. There is no difference in the effect of taking logs to different bases; one gives values that are a constant multiple of the other. It is, however, important to clarify the base used if values are quoted in log units. We use brackets when necessary to make the meaning clear, as in .

两个数量的比值的对数,例如 ,等于其对数的差,即
The logarithm of the ratio of two quantities, say and , is equal to the difference between their logarithms, i.e. .

A1.5 符号词汇表 A1.5 GLOSSARY OF NOTATION

本书中使用的符号简要说明如下,另附一些未出现但常见的符号。
The notation used in this book is briefly described below, along with a few items that do not appear but which may often be encountered.

表示第 i 组受试者的样本量。
or The sample size in the ith group of subjects.

n 或 N 表示总样本量。
n or N The total sample size.

一组观测值的样本均值,个别观测值用 表示;读作“x bar”(x 横线)。在某些章节中,观测值用其他字母表示,如 ,此时均值分别表示为
The mean of a sample of observations, where the individual observations are denoted by or ; it is pronounced 'x bar'. In some chapters observations are denoted by other letters such as or , in which case the mean is or .

希腊字母 mu,表示总体均值。
The Greek letter mu, denoting the mean of a population.

样本观测值的标准差。它衡量观测值围绕均值的变异程度。
The standard deviation of a sample of observations. It is a measure of their variability around the mean.

希腊字母 sigma,表示总体的标准差。
The Greek letter sigma, denoting the standard deviation of a population.

样本均值或其他估计统计量的标准误差。它衡量该估计值的不确定性,并用于推导总体值的置信区间。符号 表示“ 的标准误差”。
The standard error of a sample mean or some other estimated statistic. It is a measure of the uncertainty of such an estimate and is used to derive a confidence interval for the population value. The notation or means 'the standard error of '.

具有某一特征的样本比例。总体中具有该特征的比例也可用 表示,此时样本比例用 表示。
The proportion of a sample with a given characteristic. The proportion of the population with a given characteristic may also be called , in which case the sample proportion is denoted .

希腊大写字母 sigma,表示“求和”。详见 A1.2.7 节。
The Greek capital letter sigma, denoting 'sum of'. See section A1.2.7.

希腊大写字母 pi,表示“连乘积”。详见 A1.2.8 节。
The Greek capital letter pi, denoting 'product of'. See section A1.2.8.

以自然常数 e 为底的 的自然对数,也写作 。有时我们使用以 10 为底的对数,写作 。对数的相关内容见 A1.4.1 节。另见
The natural logarithm of to the base e, also written . We sometimes use logarithms to the base 10, written . Logarithms are explained in section A1.4.1. See also .

指数函数,表示对数运算的逆过程,有时称为反对数变换。另一种表示法为
The exponential function, denoting the inverse procedure to taking logarithms. It is sometimes called the antilogarithmic transformation. An alternative notation is .

(a) 假设检验的显著性水平; 是置信区间的置信水平。要在给定的 水平下进行检验,我们将检验统计量与截断分布中比例为 的理论采样分布值进行比较。最常用的 值为0.05或0.01。传统上,假设检验的 值(见下文)与 比较,当 时,检验被认为是“显著”的。现代观点是报告 值,而不将检验结果简单地视为显著与否的决定。 也称为第一类错误率。参见第8.5节。
(a) The level of a hypothesis test; is the level of the confidence interval. To perform a test at a given level of , we compare the test statistic with the theoretical value of the appropriate sampling distribution which cuts off a proportion of the distribution. Most commonly we use equal to 0.05 or 0.01. Traditionally, the value (see below) from a hypothesis test is compared with and the test is 'significant' if . The modern attitude is to present the value and not to consider the test as being a decision about whether or not the result is significant. is also known as the Type I error rate. See section 8.5.

(b) 总体回归线的截距。样本截距用 表示。
(b) The intercept of a regression line in the population. The sample intercept is denoted .
(a) 与假设检验相关的第二类错误率。假设检验的检验力为 ,即当备择假设成立时, 值低于预设显著性水平 的概率。参见第8.5节。(b) 总体回归线的斜率。样本斜率用 表示。
(a) The Type II error rate associated with a hypothesis test. The power of a hypothesis test is , and is the probability that the value will be lower than the prespecified significance level when the alternative hypothesis is true. See section 8.5. (b) The slope of a regression line in the population. The sample slope is denoted .

(b) 总体回归线的斜率。样本斜率用 表示。
(b) The slope of a regression line in the population. The sample slope is denoted

P
P

假设检验中的概率值或显著性水平。 是在原假设成立时,数据(或更极端的数据)仅因抽样变异而出现的概率。使用 更合适,避免与观察比例 混淆。
The probability value, or significance level, from a hypothesis test. is the probability of the data (or some more extreme data) arising by chance - that is, due to sampling variation only - when the null hypothesis is true. It is better to use rather than which can be confused with an observed proportion.

来自标准正态分布的值,该分布均值为0,标准差为1。下标表示该值以下的分布比例 。例如, 是标准正态分布中下方包含97.5%数据的值。比如,。分布的中间 区间位于 之间。由于正态分布的对称性,,因此中间的 区间在 之间。例如,正态分布的中间95%位于 之间,即 。本书中常用的另一种记法是 ,表示样本计算得到的检验统计量值。
A value from the standard Normal distribution, which is the theoretical Normal distribution with mean 0 and standard deviation 1. The subscript represents the proportion of the distribution below the value .Thus is the value of the standard Normal distribution below which lies the bottom 0.975 or of the distribution. Thus, for example, The central or of the distribution lies between and . Because of the symmetry of the Normal distribution, so that the central of the distribution lies between and . For example, the central of the Normal distribution lies between and , that is, between and . A common alternative notation is , used in this book for the value of the test statistic derived from a sample.

来自“Student” 分布,该分布是小样本均值的采样分布。我们用 表示样本计算值,用 表示理论分布中的相应值,其中 是自由度数。
A value from 'Student's' distribution, the sampling distribution for means of small samples. We use for the value derived from a sample, and to indicate the appropriate value from the theoretical distribution, where denotes the number of degrees of freedom.

来自“卡方”分布,是基于频数表的检验统计量的采样分布。我们用 表示样本计算值,用 表示理论分布中的相应值,其中 是自由度数。
A value from the 'Chi squared' distribution, the sampling distribution for test statistics derived from tables of frequencies. We use or for the value derived from a sample, and to indicate the appropriate value from the theoretical distribution, where denotes the number of degrees of freedom.

是样本计算的Pearson相关系数。总体相关系数用希腊字母 表示。详见第11章。
The Pearson correlation coefficient calculated from a sample. The population correlation coefficient is denoted by the Greek letter rho . See Chapter 11.

从样本计算得出的斯皮尔曼等级相关系数。详见第11章。
The Spearman rank correlation coefficient calculated from a sample. See Chapter 11.

来自“”分布的一个值,该分布是两个方差比值的抽样分布。也用来表示两个方差比值的样本值。
A value from the ' ' distribution, the sampling distribution for the ratio of two variances. is also used for the sample value of the ratio of two variances.

对应二维图中某一点的数值,如散点图,有时称为“坐标”。变量的一组值的均值记作
The values corresponding to a point in a two- dimensional graph, such as a scatter diagram, sometimes called 'coordinates'. The mean of a set of values of the variables and is denoted .

无穷大—比任何可想象的数都大的值。同理,是比任何可想象的负数都小的值。是表示标准正态分布值的水平刻度的极限。
Infinity - the value larger than any imaginable number. Likewise is the value less than any imaginable negative number. The values are the extremities of a horizontal scale representing the values of a standard Normal distribution.

阶乘,例如。详见附录A1.2.9。
Factorial, as in . See section A1.2.9.

附录 B Appendix B

表 B6 分布 Table B6 The distribution

表中列出的 分布值对应于不同分子自由度 和分母自由度 下的单尾 P 值。对于检验统计量服从 分布的单侧假设检验,如果检验统计量大于表中 值,则 值小于表中对应的 值。
The tabulated values of the distribution correspond to given one- tailed P values for different degrees of freedom for the numerator and denominator . For the one- sided hypothesis test where the test statistic has an distribution, the value is less than a tabulated value of if the test statistic is greater than the tabulated value of .

示例:当观察到的检验统计量为 ,自由度为 2 和 20 时,有
Example: For an observed test statistic on 2 and 20 degrees of freedom we have .

表格 B6
Table B6

n2P1234567n1891011121520
10.139.949.553.655.857.258.259.159.760.160.561.061.562.063.3
0.05161.4199.5215.8224.7230.4234.2237.0239.1240.8242.1244.2246.2248.3254.3
0.014051.84999.55403.55624.85763.85859.25928.65981.36022.76056.16106.66157.66209.06365.9
20.18.539.009.169.249.299.339.359.379.389.399.419.439.449.49
0.0518.5119.0019.1619.2519.3019.3319.3519.3719.3819.4019.4119.4319.4519.50
0.0198.5099.0099.1799.2599.3099.3399.3699.7599.7899.8099.8399.8799.9099.50
30.15.545.465.395.345.315.285.275.255.245.235.225.185.13
0.0510.139.559.289.129.018.948.898.858.818.798.748.708.668.53
0.0134.1130.8229.4628.7128.2427.9127.6727.4927.3427.2327.0526.8726.6926.13
40.14.544.324.194.114.054.013.983.953.943.923.903.873.843.76
0.057.716.946.596.396.266.166.096.046.005.965.915.865.805.63
0.0121.2018.0016.6915.9815.5215.2114.9814.8014.6614.5514.3714.2014.0213.46
50.14.063.783.623.523.453.403.373.343.323.303.273.243.213.10
0.056.615.795.415.195.054.954.884.824.774.744.684.624.564.36
0.0116.2613.2712.0611.3910.9710.6710.4610.2910.1610.059.899.729.559.02
60.13.783.463.293.183.113.053.012.982.962.942.902.872.72
0.055.995.144.764.534.394.284.214.154.104.064.003.943.873.67
0.0113.7410.929.789.158.758.478.268.107.987.877.727.567.406.88
n2P1234567n1891011121520
10.139.949.553.655.857.258.259.159.760.160.561.061.562.063.3
0.05161.4199.5215.8224.7230.4234.2237.0239.1240.8242.1244.2246.2248.3254.3
0.014051.84999.55403.55624.85763.85859.25928.65981.36022.76056.16106.66157.66209.06365.9
20.18.539.009.169.249.299.339.359.379.389.399.419.439.449.49
0.0518.5119.0019.1619.2519.3019.3319.3519.3719.3819.4019.4119.4319.4519.50
0.0198.5099.0099.1799.2599.3099.3399.3699.7599.7899.8099.8399.8799.9099.50
30.15.545.465.395.345.315.285.275.255.245.235.225.185.13
0.0510.139.559.289.129.018.948.898.858.818.798.748.708.668.53
0.0134.1130.8229.4628.7128.2427.9127.6727.4927.3427.2327.0526.8726.6926.13
40.14.544.324.194.114.054.013.983.953.943.923.903.873.843.76
0.057.716.946.596.396.266.166.096.046.005.965.915.865.805.63
0.0121.2018.0016.6915.9815.5215.2114.9814.8014.6614.5514.3714.2014.0213.46
50.14.063.783.623.523.453.403.373.343.323.303.273.243.213.10
0.056.615.795.415.195.054.954.884.824.774.744.684.624.564.36
0.0116.2613.2712.0611.3910.9710.6710.4610.2910.1610.059.899.729.559.02
60.13.783.463.293.183.113.053.012.982.962.942.902.872.72
0.055.995.144.764.534.394.284.214.154.104.064.003.943.873.67
0.0113.7410.929.789.158.758.478.268.107.987.877.727.567.406.88

表格 B6()
Table B6(cont.)

n2P123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115<fnl>
n2P123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100101102103104105106107108109110111112113114115<fnl>
n2P1234567n1891011121520
200.12.972.592.382.252.162.092.042.001.961.941.891.841.791.61
0.054.353.493.102.872.712.602.512.452.392.352.282.202.121.84
0.018.105.854.944.434.103.873.703.563.463.373.233.092.942.42
300.12.882.492.282.142.051.981.931.881.851.821.771.721.671.46
0.054.173.322.922.692.532.422.332.272.212.162.092.011.931.62
0.017.565.394.514.023.703.473.303.173.072.982.842.702.552.01
400.12.842.442.232.092.001.931.871.831.791.761.711.661.611.38
0.054.083.232.842.612.452.342.252.182.122.082.001.921.841.51
0.017.315.184.313.833.513.293.122.992.892.802.662.522.371.80
600.12.792.392.182.041.951.871.821.771.741.711.661.601.541.29
0.054.003.152.762.532.372.252.172.102.041.991.921.841.751.39
0.017.084.984.133.653.343.122.952.822.722.632.502.352.201.60
1200.12.752.352.131.991.901.821.771.721.681.651.601.541.481.19
0.053.923.072.682.452.292.182.092.021.961.911.831.751.661.25
0.016.854.793.953.483.172.962.792.662.562.472.342.192.031.38
0.12.712.302.081.941.851.771.721.671.631.601.551.491.421.13
0.053.843.002.602.372.212.102.011.941.881.831.751.671.571.17
0.016.634.613.783.323.022.802.642.512.412.322.182.041.881.24
n2P1234567n1891011121520
200.12.972.592.382.252.162.092.042.001.961.941.891.841.791.61
0.054.353.493.102.872.712.602.512.452.392.352.282.202.121.84
0.018.105.854.944.434.103.873.703.563.463.373.233.092.942.42
300.12.882.492.282.142.051.981.931.881.851.821.771.721.671.46
0.054.173.322.922.692.532.422.332.272.212.162.092.011.931.62
0.017.565.394.514.023.703.473.303.173.072.982.842.702.552.01
400.12.842.442.232.092.001.931.871.831.791.761.711.661.611.38
0.054.083.232.842.612.452.342.252.182.122.082.001.921.841.51
0.017.315.184.313.833.513.293.122.992.892.802.662.522.371.80
600.12.792.392.182.041.951.871.821.771.741.711.661.601.541.29
0.054.003.152.762.532.372.252.172.102.041.991.921.841.751.39
0.017.084.984.133.653.343.122.952.822.722.632.502.352.201.60
1200.12.752.352.131.991.901.821.771.721.681.651.601.541.481.19
0.053.923.072.682.452.292.182.092.021.961.911.831.751.661.25
0.016.854.793.953.483.172.962.792.662.562.472.342.192.031.38
0.12.712.302.081.941.851.771.721.671.631.601.551.491.421.13
0.053.843.002.602.372.212.102.011.941.881.831.751.671.571.17
0.016.634.613.783.323.022.802.642.512.412.322.182.041.881.24

表 B7 皮尔逊相关系数 (r) Table B7 Pearson's correlation coefficient (r)

如果观察到的相关系数 超过表中对应的数值,则对应的双尾 值小于该列顶部的值。对于负值的 ,忽略其符号。对于偏相关系数,样本量需减1。
If the observed correlation coefficient, , exceeds the tabulated value, the associated two- tailed value is less than the value at the top of the column. For negative values of ignore the sign. For partial correlation coefficients reduce the sample size by 1.

例子:对于样本量为39时观察到的 ,双尾 值为
Example: For an observed value of from a sample of 39, the two- tailed value is

样本量双尾概率 (P)
0.20.10.050.020.010.001
30.95110.98770.99690.99950.99991.0000
40.80000.90000.95000.98000.99000.9990
50.68700.80540.87830.93430.95870.9911
60.60840.72930.81140.88220.91720.9741
70.55090.66940.75450.83290.87450.9509
80.50670.62150.70670.78870.83430.9249
90.47160.58220.66640.74980.79770.8983
100.44280.54940.63190.71550.76460.8721
110.41870.52140.60210.68510.73480.8470
120.39810.49730.57600.65810.70790.8233
130.38020.47620.55290.63390.68350.8010
140.36460.45750.53240.61200.66140.7800
150.35070.44090.51400.59230.64110.7604
160.33830.42590.49730.57420.62260.7419
170.32710.41240.48210.55770.60550.7247
180.31700.40000.46830.54250.58970.7084
190.30770.38870.45550.52850.57510.6932
200.29920.37830.44380.51550.56140.6788
210.29140.36870.43290.50340.54870.6652
220.28410.35980.42270.49210.53680.6524
230.27740.35150.41320.48150.52560.6402
240.27110.34380.40440.47160.51510.6287
250.26530.33650.39610.46220.50520.6178
260.25980.32970.38820.45340.49580.6074
270.25460.32330.38090.44510.48690.5974
280.24970.31720.37390.43720.47850.5880
290.24510.31150.36730.42970.47050.5790
300.24070.30610.36100.42260.46290.5703
Sample sizeTwo-tailed probability (P)
0.20.10.050.020.010.001
30.95110.98770.99690.99950.99991.0000
40.80000.90000.95000.98000.99000.9990
50.68700.80540.87830.93430.95870.9911
60.60840.72930.81140.88220.91720.9741
70.55090.66940.75450.83290.87450.9509
80.50670.62150.70670.78870.83430.9249
90.47160.58220.66640.74980.79770.8983
100.44280.54940.63190.71550.76460.8721
110.41870.52140.60210.68510.73480.8470
120.39810.49730.57600.65810.70790.8233
130.38020.47620.55290.63390.68350.8010
140.36460.45750.53240.61200.66140.7800
150.35070.44090.51400.59230.64110.7604
160.33830.42590.49730.57420.62260.7419
170.32710.41240.48210.55770.60550.7247
180.31700.40000.46830.54250.58970.7084
190.30770.38870.45550.52850.57510.6932
200.29920.37830.44380.51550.56140.6788
210.29140.36870.43290.50340.54870.6652
220.28410.35980.42270.49210.53680.6524
230.27740.35150.41320.48150.52560.6402
240.27110.34380.40440.47160.51510.6287
250.26530.33650.39610.46220.50520.6178
260.25980.32970.38820.45340.49580.6074
270.25460.32330.38090.44510.48690.5974
280.24970.31720.37390.43720.47850.5880
290.24510.31150.36730.42970.47050.5790
300.24070.30610.36100.42260.46290.5703
样本量双尾概率 (P)
0.20.10.050.020.010.001
310.23660.30090.35500.41580.45560.5620
320.23270.29600.34940.40930.44870.5541
330.22890.29130.34400.40320.44210.5465
340.22540.28690.33880.39720.43570.5392
350.22200.28260.33380.39160.42960.5322
360.21870.27850.32910.38620.42380.5254
370.21560.27460.32460.38100.41820.5189
380.21260.27090.32020.37600.41280.5126
390.20970.26730.31600.37120.40760.5066
400.20700.26380.31200.36650.40260.5007
410.20430.26050.30810.36210.39780.4950
420.20180.25730.30440.35780.39320.4896
430.19930.25420.30080.35360.38870.4843
440.19700.25120.29730.34960.38430.4791
450.19470.24830.29400.34570.38010.4742
460.19250.24550.29070.34200.37610.4694
470.19030.24290.28760.33840.37210.4647
480.18830.24030.28450.33480.36830.4601
490.18630.23770.28160.33140.36460.4557
500.18430.23530.27870.32810.36100.4514
510.18250.23290.27590.32490.35750.4473
520.18060.23060.27320.32180.35420.4432
530.17890.22840.27060.31880.35090.4393
540.17720.22620.26810.31580.34770.4354
550.17550.22410.26560.31290.34450.4317
560.17390.22210.26320.31020.34150.4280
570.17230.22010.26090.30740.33850.4244
580.17080.21810.25860.30480.33570.4210
590.16930.21620.25640.30220.33280.4176
600.16780.21440.25420.29970.33010.4143
700.15500.19820.23520.27760.30600.3850
800.14480.18520.21990.25970.28640.3611
900.13640.17450.20720.24490.27020.3412
1000.12920.16540.19660.23240.25650.3242
1100.12310.15760.18740.22160.24460.3095
1200.11780.15090.17930.21220.23430.2967
1300.11310.14490.17230.20390.22520.2853
1400.10900.13960.16600.19650.21700.2752
1500.10520.13480.16030.18980.20970.2660
Sample sizeTwo-tailed probability (P)
0.20.10.050.020.010.001
310.23660.30090.35500.41580.45560.5620
320.23270.29600.34940.40930.44870.5541
330.22890.29130.34400.40320.44210.5465
340.22540.28690.33880.39720.43570.5392
350.22200.28260.33380.39160.42960.5322
360.21870.27850.32910.38620.42380.5254
370.21560.27460.32460.38100.41820.5189
380.21260.27090.32020.37600.41280.5126
390.20970.26730.31600.37120.40760.5066
400.20700.26380.31200.36650.40260.5007
410.20430.26050.30810.36210.39780.4950
420.20180.25730.30440.35780.39320.4896
430.19930.25420.30080.35360.38870.4843
440.19700.25120.29730.34960.38430.4791
450.19470.24830.29400.34570.38010.4742
460.19250.24550.29070.34200.37610.4694
470.19030.24290.28760.33840.37210.4647
480.18830.24030.28450.33480.36830.4601
490.18630.23770.28160.33140.36460.4557
500.18430.23530.27870.32810.36100.4514
510.18250.23290.27590.32490.35750.4473
520.18060.23060.27320.32180.35420.4432
530.17890.22840.27060.31880.35090.4393
540.17720.22620.26810.31580.34770.4354
550.17550.22410.26560.31290.34450.4317
560.17390.22210.26320.31020.34150.4280
570.17230.22010.26090.30740.33850.4244
580.17080.21810.25860.30480.33570.4210
590.16930.21620.25640.30220.33280.4176
600.16780.21440.25420.29970.33010.4143
700.15500.19820.23520.27760.30600.3850
800.14480.18520.21990.25970.28640.3611
900.13640.17450.20720.24490.27020.3412
1000.12920.16540.19660.23240.25650.3242
1100.12310.15760.18740.22160.24460.3095
1200.11780.15090.17930.21220.23430.2967
1300.11310.14490.17230.20390.22520.2853
1400.10900.13960.16600.19650.21700.2752
1500.10520.13480.16030.18980.20970.2660

对于 服从自由度为 分布。
For has a distribution with degrees of freedom.

表 B8 斯皮尔曼等级相关系数 Table B8 Spearman's rank correlation coefficient

如果观测到的等级相关系数 超过表中对应的值,则对应的双尾 值小于该列顶部的值。对于负的 值,忽略其符号。对于偏相关系数,样本量需减 1。
If the observed rank correlation coefficient, , exceeds the tabulated value, the associated two- tailed value is less than the value at the top of the column. For negative values of ignore the sign. For partial correlation coefficients reduce the sample size by 1.

举例:对于样本量为 19 的观测值 ,双尾 值为
Example: For an observed value of from a sample of 19, the two- tailed value is

样本量双尾概率 (P)
0.20.10.050.020.010.002
40.80000.8000
50.70000.80000.90000.9000
60.60000.77140.82860.88570.9429
70.53570.67860.74500.85710.89290.9643
80.50000.61900.71430.80950.85710.9286
90.46670.58330.68330.76670.81670.9000
100.44240.55150.63640.73330.78180.8667
110.41820.52730.60910.70000.74550.8364
120.39860.49650.58040.67130.72730.8182
130.37910.47800.55490.64290.69780.7912
140.36260.45930.53410.62200.67470.7670
150.35000.44290.51790.60000.65360.7464
160.33820.42650.50000.58240.63240.7265
170.32600.41180.48530.56370.61520.7083
180.31480.39940.47160.54800.59750.6904
190.30700.38950.45790.53330.58250.6737
200.29770.37890.44510.52030.56840.6586
210.29090.36880.43510.50780.55450.6455
220.28290.35970.42410.49630.54260.6318
230.27670.35180.41500.48520.53060.6186
240.27040.34350.40610.47480.52000.6070
250.26460.33620.39770.46540.51000.5962
260.25880.32990.38940.45640.50020.5856
270.25400.32360.38220.44810.49150.5757
280.24900.31750.37490.44010.48280.5660
290.24430.31130.36850.43200.47440.5567
300.24000.30590.36200.42510.46650.5479
Sample sizeTwo-tailed probability (P)
0.20.10.050.020.010.002
40.80000.8000
50.70000.80000.90000.9000
60.60000.77140.82860.88570.9429
70.53570.67860.74500.85710.89290.9643
80.50000.61900.71430.80950.85710.9286
90.46670.58330.68330.76670.81670.9000
100.44240.55150.63640.73330.78180.8667
110.41820.52730.60910.70000.74550.8364
120.39860.49650.58040.67130.72730.8182
130.37910.47800.55490.64290.69780.7912
140.36260.45930.53410.62200.67470.7670
150.35000.44290.51790.60000.65360.7464
160.33820.42650.50000.58240.63240.7265
170.32600.41180.48530.56370.61520.7083
180.31480.39940.47160.54800.59750.6904
190.30700.38950.45790.53330.58250.6737
200.29770.37890.44510.52030.56840.6586
210.29090.36880.43510.50780.55450.6455
220.28290.35970.42410.49630.54260.6318
230.27670.35180.41500.48520.53060.6186
240.27040.34350.40610.47480.52000.6070
250.26460.33620.39770.46540.51000.5962
260.25880.32990.38940.45640.50020.5856
270.25400.32360.38220.44810.49150.5757
280.24900.31750.37490.44010.48280.5660
290.24430.31130.36850.43200.47440.5567
300.24000.30590.36200.42510.46650.5479

对于 服从自由度为 分布。
For , has a distribution with degrees of freedom.

表 B9 威尔科克森单样本(或配对样本)检验 Table B9 Wilcoxon one sample (or matched pairs) test

如果正(或负)秩和等于表中数值或超出显示范围(即不在表中数值之间),则检验的 值小于该列顶部的数值。所示样本量 是非零差异的数量。
If the sum of positive (or negative) ranks is equal to the tabulated values or is outside the range shown (i.e is not between the tabulated values), the value of the test is less than the value at the top of the column. The sample size shown, , is the number of non- zero differences.

示例:对于样本量为 11 的秩和为 8,我们在 的行上从左向右查找,直到找到最后一列中数值 8 不在表中数值之间,该列对应的 。秩和为 7 则对应
Example: For a rank sum of 8 from a sample of 11 we look along the row for from the left until we find the last column where the value 8 is not between the tabulated values, which gives . A rank sum of 7 would give .

n双尾概率 (P)
0.20.10.050.020.010.001
40-10-----
52-130-15----
63-182-190-21---
75-233-252-260-28--
88-285-313-331-350-36-
910-358-375-403-421-44-
1014-4110-458-475-503-52-
1117-4913-5310-567-595-610-66
1221-5717-6113-659-697-711-77
1326-6521-7017-7412-799-822-89
1431-7425-8021-8415-9012-934-101
1536-8430-9025-9519-10115-1056-114
1642-9435-10129-10723-11319-1179-127
1748-10541-11234-11928-12523-13011-142
1855-11647-12440-13132-13927-14414-157
1962-12853-13746-14437-15332-15818-172
2069-14160-15052-15843-16737-17321-189
2177-15467-16458-17349-18242-18926-205
2286-16775-17866-18755-19848-10530-223
2395-18183-19373-20362-21454-22235-241
24104-19691-20981-21969-23161-23940-260
25114-211100-22589-23676-24968-25745-280
nTwo-tailed probability (P)
0.20.10.050.020.010.001
40-10-----
52-130-15----
63-182-190-21---
75-233-252-260-28--
88-285-313-331-350-36-
910-358-375-403-421-44-
1014-4110-458-475-503-52-
1117-4913-5310-567-595-610-66
1221-5717-6113-659-697-711-77
1326-6521-7017-7412-799-822-89
1431-7425-8021-8415-9012-934-101
1536-8430-9025-9519-10115-1056-114
1642-9435-10129-10723-11319-1179-127
1748-10541-11234-11928-12523-13011-142
1855-11647-12440-13132-13927-14414-157
1962-12853-13746-14437-15332-15818-172
2069-14160-15052-15843-16737-17321-189
2177-15467-16458-17349-18242-18926-205
2286-16775-17866-18755-19848-10530-223
2395-18183-19373-20362-21454-22235-241
24104-19691-20981-21969-23161-23940-260
25114-211100-22589-23676-24968-25745-280

对于 ,秩和 近似服从正态分布,其均值为 ,标准差为
For the rank sum has an approximately Normal distribution with mean and standard deviation

检验统计量 使用表 B2 进行评估。
The test statistic is evaluated using Table B2.

表 B10 曼-惠特尼检验(威尔科克森两样本检验) Table B10 The Mann-Whitney test (Wilcoxon two sample test)

对于两个样本量分别为 的组比较,且 ,如果较小组的秩和等于表中数值或超出显示范围(即不在表中数值之间),则检验的 值小于该列顶部的数值。
For a comparison of two groups of size and , where , if the sum of the ranks in the smaller group is equal to the tabulated values or is outside the range shown (i.e. is not between the tabulated values), the value of the test is less than the value at the top of the column.

示例:对于样本量为 6 和 11 的两组比较,较小组秩和为 29,我们在 的行上从左向右查找,直到找到最后一列中数值 29 不在表中数值之间,该列对应的 。秩和为 28 则对应
Example: For a rank sum of 29 from a sample of 6 compared with a sample of 11 we look along the row for and from the left until we find the last column where the value 29 is not between the tabulated values, which gives . A rank sum of 28 would give .

n1n2双尾概率 (P)
0.10.050.020.010.001
336-15----
346-18----
4411-2510-26---
253-13----
357-206-21---
4512-2811-2910-30--
5519-3617-3816-3915-40-
263-15----
368-227-23---
4613-3112-3211-3310-34-
5620-4018-4217-4316-44-
6628-5026-5224-5423-55-
273-17----
378-257-266-27-
4714-3413-3511-3710-38-
5721-4420-4518-4716-49-
6729-5527-5725-5924-60-
7739-6636-6934-7132-7328-77
284-183-19---
389-278-286-30--
4815-3714-3812-4011-41-
5823-4721-4919-5117-53-
6831-5929-6127-6325-6521-69
7841-7138-7435-7734-7829-83
8851-8549-8745-9143-9338-98
n1n2Two-tailed probability (P)
0.10.050.020.010.001
336-15----
346-18----
4411-2510-26---
253-13----
357-206-21---
4512-2811-2910-30--
5519-3617-3816-3915-40-
263-15----
368-227-23---
4613-3112-3211-3310-34-
5620-4018-4217-4316-44-
6628-5026-5224-5423-55-
273-17----
378-257-266-27-
4714-3413-3511-3710-38-
5721-4420-4518-4716-49-
6729-5527-5725-5924-60-
7739-6636-6934-7132-7328-77
284-183-19---
389-278-286-30--
4815-3714-3812-4011-41-
5823-4721-4919-5117-53-
6831-5929-6127-6325-6521-69
7841-7138-7435-7734-7829-83
8851-8549-8745-9143-9338-98
n1n2双尾概率 (P)
0.10.050.020.010.001
294-203-21---
3910-298-317-326-33-
4916-4014-4213-4311-45-
5924-5122-5320-5518-5715-60
6933-6331-6528-6826-7022-74
7943-7640-7937-8235-8430-89
8954-9051-9347-9745-9940-104
9966-10562-10959-11256-11550-121
2104-223-23---
31010-329-337-356-36-
41017-4315-4513-4712-48-
51026-5423-5721-5919-6115-65
61035-6732-7029-7327-7523-79
71045-8142-8439-8737-8931-95
81056-9653-9949-10347-10541-111
91069-11165-11561-11958-12252-128
101082-12878-13274-13671-13963-147
2114-243-25---
31111-349-367-386-39-
41118-4616-4814-5012-52-
51127-5824-6122-6320-6516-69
61137-7134-7430-7828-8023-85
71147-8644-8940-9338-9532-101
81159-10155-10551-10949-11142-118
91172-11768-12163-12661-12853-136
101186-13481-13977-14373-14765-155
1111100-15396-15791-16287-16678-175
2125-254-26---
31211-3710-388-407-41-
41219-4917-5115-5313-55-
51228-6226-6423-6721-6916-74
61238-7635-7932-8230-8424-90
71249-9146-9442-9840-10033-107
81262-10658-11053-11551-11743-125
91275-12371-12766-13263-13555-143
101289-14184-14679-15176-15467-163
1112104-16099-16594-17090-17481-183
1212120-180115-185109-191105-19595-205
n1n2Two-tailed probability (P)
0.10.050.020.010.001
294-203-21---
3910-298-317-326-33-
4916-4014-4213-4311-45-
5924-5122-5320-5518-5715-60
6933-6331-6528-6826-7022-74
7943-7640-7937-8235-8430-89
8954-9051-9347-9745-9940-104
9966-10562-10959-11256-11550-121
2104-223-23---
31010-329-337-356-36-
41017-4315-4513-4712-48-
51026-5423-5721-5919-6115-65
61035-6732-7029-7327-7523-79
71045-8142-8439-8737-8931-95
81056-9653-9949-10347-10541-111
91069-11165-11561-11958-12252-128
101082-12878-13274-13671-13963-147
2114-243-25---
31111-349-367-386-39-
41118-4616-4814-5012-52-
51127-5824-6122-6320-6516-69
61137-7134-7430-7828-8023-85
71147-8644-8940-9338-9532-101
81159-10155-10551-10949-11142-118
91172-11768-12163-12661-12853-136
101186-13481-13977-14373-14765-155
1111100-15396-15791-16287-16678-175
2125-254-26---
31211-3710-388-407-41-
41219-4917-5115-5313-55-
51228-6226-6423-6721-6916-74
61238-7635-7932-8230-8424-90
71249-9146-9442-9840-10033-107
81262-10658-11053-11551-11743-125
91275-12371-12766-13263-13555-143
101289-14184-14679-15176-15467-163
1112104-16099-16594-17090-17481-183
1212120-180115-185109-191105-19595-205
n1n2双尾概率 (P)
0.10.050.020.010.001
2135-274-283-29--
31312-3910-418-437-44-
41320-5218-5415-5713-5910-62
51330-6527-6824-7122-7317-78
61340-8037-8333-8731-8925-95
71352-9548-9944-10341-10634-113
81364-11260-11656-12053-12345-131
91378-12973-13468-13965-14256-151
101392-14888-15282-15879-16169-171
1113108-167103-17297-17893-18283-192
1213125-187119-193113-199109-20398-214
1313142-209136-215130-221125-226114-237
2146-284-303-31--
31413-4111-438-467-47-
41421-5519-5716-6014-6210-66
51431-6928-7225-7522-7817-83
61442-8438-8834-9232-9426-100
71454-10050-10445-10943-11135-119
81467-11762-12258-12654-13046-138
91481-13576-14071-14567-14958-158
101496-15491-15985-16581-16971-179
1114112-174106-180100-18696-19085-201
1214129-195123-201116-208112-212100-224
1314147-217141-223134-230129-235116-248
1414166-240160-246152-254147-259134-272
2156-304-323-33--
31513-4411-469-488-49-
41522-5820-6017-6315-6510-70
51533-7229-7626-7923-8218-87
61544-8840-9236-9633-9926-106
71556-10552-10947-11444-11736-125
81569-12365-12760-13256-13647-145
91584-14179-14673-15269-15660-165
101599-16194-16688-17284-17673-187
1115116-181110-187103-19499-19887-210
1215133-203127-209120-216115-221103-233
1315152-225145-232138-239133-244119-258
1415171-249164-256156-264151-269137-283
1515192-273184-281176-289171-294156-309
n1n2Two-tailed probability (P)
0.10.050.020.010.001
2135-274-283-29--
31312-3910-418-437-44-
41320-5218-5415-5713-5910-62
51330-6527-6824-7122-7317-78
61340-8037-8333-8731-8925-95
71352-9548-9944-10341-10634-113
81364-11260-11656-12053-12345-131
91378-12973-13468-13965-14256-151
101392-14888-15282-15879-16169-171
1113108-167103-17297-17893-18283-192
1213125-187119-193113-199109-20398-214
1313142-209136-215130-221125-226114-237
2146-284-303-31--
31413-4111-438-467-47-
41421-5519-5716-6014-6210-66
51431-6928-7225-7522-7817-83
61442-8438-8834-9232-9426-100
71454-10050-10445-10943-11135-119
81467-11762-12258-12654-13046-138
91481-13576-14071-14567-14958-158
101496-15491-15985-16581-16971-179
1114112-174106-180100-18696-19085-201
1214129-195123-201116-208112-212100-224
1314147-217141-223134-230129-235116-248
1414166-240160-246152-254147-259134-272
2156-304-323-33--
31513-4411-469-488-49-
41522-5820-6017-6315-6510-70
51533-7229-7626-7923-8218-87
61544-8840-9236-9633-9926-106
71556-10552-10947-11444-11736-125
81569-12365-12760-13256-13647-145
91584-14179-14673-15269-15660-165
101599-16194-16688-17284-17673-187
1115116-181110-187103-19499-19887-210
1215133-203127-209120-216115-221103-233
1315152-225145-232138-239133-244119-258
1415171-249164-256156-264151-269137-283
1515192-273184-281176-289171-294156-309

对于较大的样本,较小组的秩和 近似服从正态分布,其均值为 ,标准差为
For larger samples, the sum of the ranks in the smaller group, , has an approximately Normal distribution with mean and standard deviation

使用表B2计算检验统计量
The test statistic is evaluated using Table B2.

表B11 用于获得中位数置信区间的秩次 Table B11 Ranks for obtaining a confidence interval for the median

表中数值为观测值的秩次,基于单一样本数据提供约 的总体中位数置信区间。
The tabulated values are ranks of the observations that provide approximate , or confidence intervals for the population median based on a single sample of data.

例:样本量为56时, 总体中位数置信区间由秩次为18和39的观测值确定。
Example: The confidence interval for the population median calculated from a sample of size 56 is given by the observations with ranks 18 and 39.

样本量置信水平
90%95%99%
61, 61, 6-
71, 71, 7-
82, 71, 81, 8
92, 82, 81, 9
102, 92, 91, 10
113, 92, 101, 11
123, 103, 102, 11
134, 103, 112, 12
144, 113, 122, 13
154, 124, 123, 13
165, 124, 133, 14
175, 135, 133, 15
186, 135, 144, 15
196, 145, 154, 16
206, 156, 154, 17
217, 156, 165, 17
227, 166, 175, 18
238, 167, 175, 19
248, 177, 186, 19
258, 188, 186, 20
269, 188, 197, 20
279, 198, 207, 21
2810, 199, 207, 22
2910, 209, 218, 22
3011, 2010, 218, 23
3111, 2110, 228, 24
3211, 2210, 239, 24
3312, 2211, 239, 25
3412, 2311, 2410, 25
3513, 2312, 2410, 26
Sample sizeLevel of confidence
90%95%99%
61, 61, 6-
71, 71, 7-
82, 71, 81, 8
92, 82, 81, 9
102, 92, 91, 10
113, 92, 101, 11
123, 103, 102, 11
134, 103, 112, 12
144, 113, 122, 13
154, 124, 123, 13
165, 124, 133, 14
175, 135, 133, 15
186, 135, 144, 15
196, 145, 154, 16
206, 156, 154, 17
217, 156, 165, 17
227, 166, 175, 18
238, 167, 175, 19
248, 177, 186, 19
258, 188, 186, 20
269, 188, 197, 20
279, 198, 207, 21
2810, 199, 207, 22
2910, 209, 218, 22
3011, 2010, 218, 23
3111, 2110, 228, 24
3211, 2210, 239, 24
3312, 2211, 239, 25
3412, 2311, 2410, 25
3513, 2312, 2410, 26
样本量置信水平
90%95%99%
3613, 2412, 2510, 27
3714, 2413, 2511, 27
3814, 2513, 2611, 28
3914, 2613, 2712, 28
4015, 2614, 2712, 29
4115, 2714, 2812, 30
4216, 2715, 2813, 30
4316, 2815, 2913, 31
4417, 2816, 2914, 31
4517, 2916, 3014, 32
4617, 3016, 3114, 33
4718, 3017, 3115, 33
4818, 3117, 3215, 34
4919, 3118, 3216, 34
5019, 3218, 3316, 35
5120, 3219, 3316, 36
5220, 3319, 3417, 36
5321, 3319, 3517, 37
5421, 3420, 3518, 37
5521, 3520, 3618, 38
5622, 3521, 3618, 39
5722, 3621, 3719, 39
5823, 3622, 3719, 40
5923, 3722, 3820, 40
6024, 3722, 3920, 41
6124, 3823, 3921, 41
6225, 3823, 4021, 42
6325, 3924, 4021, 43
6425, 4024, 4122, 43
6526, 4025, 4122, 44
6626, 4125, 4223, 44
6727, 4126, 4223, 45
6827, 4226, 4323, 46
6928, 4226, 4424, 46
7028, 4327, 4424, 47
Sample sizeLevel of confidence
90%95%99%
3613, 2412, 2510, 27
3714, 2413, 2511, 27
3814, 2513, 2611, 28
3914, 2613, 2712, 28
4015, 2614, 2712, 29
4115, 2714, 2812, 30
4216, 2715, 2813, 30
4316, 2815, 2913, 31
4417, 2816, 2914, 31
4517, 2916, 3014, 32
4617, 3016, 3114, 33
4718, 3017, 3115, 33
4818, 3117, 3215, 34
4919, 3118, 3216, 34
5019, 3218, 3316, 35
5120, 3219, 3316, 36
5220, 3319, 3417, 36
5321, 3319, 3517, 37
5421, 3420, 3518, 37
5521, 3520, 3618, 38
5622, 3521, 3618, 39
5722, 3621, 3719, 39
5823, 3622, 3719, 40
5923, 3722, 3820, 40
6024, 3722, 3920, 41
6124, 3823, 3921, 41
6225, 3823, 4021, 42
6325, 3924, 4021, 43
6425, 4024, 4122, 43
6526, 4025, 4122, 44
6626, 4125, 4223, 44
6727, 4126, 4223, 45
6827, 4226, 4323, 46
6928, 4226, 4424, 46
7028, 4327, 4424, 47
样本量置信水平
90%95%99%
7129, 4327, 4525, 47
7229, 4428, 4525, 48
7329, 4528, 4626, 48
7430, 4529, 4626, 49
7530, 4629, 4726, 50
7631, 4629, 4827, 50
7731, 4730, 4827, 51
7832, 4730, 4928, 51
7932, 4831, 4928, 52
8033, 4831, 5029, 52
8133, 4932, 5029, 53
8234, 4932, 5129, 54
8334, 5033, 5130, 54
8434, 5133, 5230, 55
8535, 5133, 5331, 55
8635, 5234, 5331, 56
8736, 5234, 5432, 56
8836, 5335, 5432, 57
8937, 5335, 5532, 58
9037, 5436, 5533, 58
9138, 5436, 5633, 59
9238, 5537, 5634, 59
9339, 5537, 5734, 60
9439, 5638, 5735, 60
9539, 5738, 5835, 61
9640, 5738, 5935, 62
9740, 5839, 5936, 62
9841, 5839, 6036, 63
9941, 5940, 6037, 63
10042, 5940, 6137, 64
Sample sizeLevel of confidence
90%95%99%
7129, 4327, 4525, 47
7229, 4428, 4525, 48
7329, 4528, 4626, 48
7430, 4529, 4626, 49
7530, 4629, 4726, 50
7631, 4629, 4827, 50
7731, 4730, 4827, 51
7832, 4730, 4928, 51
7932, 4831, 4928, 52
8033, 4831, 5029, 52
8133, 4932, 5029, 53
8234, 4932, 5129, 54
8334, 5033, 5130, 54
8434, 5133, 5230, 55
8535, 5133, 5331, 55
8635, 5234, 5331, 56
8736, 5234, 5432, 56
8836, 5335, 5432, 57
8937, 5335, 5532, 58
9037, 5436, 5533, 58
9138, 5436, 5633, 59
9238, 5537, 5634, 59
9339, 5537, 5734, 60
9439, 5638, 5735, 60
9539, 5738, 5835, 61
9640, 5738, 5935, 62
9740, 5839, 5936, 62
9841, 5839, 6036, 63
9941, 5940, 6037, 63
10042, 5940, 6137, 64

对于样本量大于100的情况,所需的秩次为最接近以下两个值的整数:
For sample larger than 100, the required ranks are the nearest integers to

其中 或 0.01。 的数值见表B3。
where or 0.01. Values of are given in Table B3.

表B12 Shapiro-Francia 非正态性检验 Table B12 The Shapiro-Francia test of non-Normality

值是使表中值超过观察到的 值的最小概率,即 值越小,越表明数据偏离正态分布。
The value is the smallest for which the tabulated value exceeds the observed value of , i.e. small values of indicate non- Normality.

示例:一个样本量为18的观察值 ,对应
Example: An observed value of from a sample of 18 gives .

样本量概率 (P)
0.20.10.050.020.010.001
100.90100.87280.84450.80630.77650.6710
110.90680.88040.85370.81740.78900.6872
120.91200.88710.86180.82730.80010.7021
130.91660.89300.86900.83610.81010.7156
140.92080.89840.87550.84410.81920.7281
150.92450.90320.88140.85140.82750.7395
160.92790.90760.88680.85800.83500.7501
170.93090.91150.89160.86400.84190.7599
180.93370.91520.89610.86950.84830.7690
190.93630.91850.90010.87460.85410.7774
200.93870.92160.90390.87930.85950.7853
210.94090.92440.90740.88370.86460.7927
220.94290.92700.91060.88770.86920.7997
230.94480.92950.91360.89150.87360.8061
240.94650.93170.91640.89500.87770.8123
250.94820.93390.91900.89830.88150.8180
300.95500.94270.92990.91210.89760.8424
350.96010.94940.93820.92250.90980.8612
400.96420.95460.94460.93070.91940.8761
450.96740.95880.94980.93730.92710.8882
500.97010.96220.95410.94270.93350.8982
550.97230.96510.95770.94720.93880.9066
600.97420.96760.96070.95110.94330.9137
650.97590.96970.96330.95440.94720.9199
700.97730.97160.96560.95730.95060.9252
750.97860.97320.96760.95980.95360.9299
800.97970.97460.96940.96210.95620.9340
850.98070.97590.97100.96410.95850.9376
900.98160.97710.97240.96590.96060.9409
950.98250.97810.97370.96750.96250.9439
1000.98320.97910.97480.96890.96420.9465
Sample sizeProbability (P)
0.20.10.050.020.010.001
100.90100.87280.84450.80630.77650.6710
110.90680.88040.85370.81740.78900.6872
120.91200.88710.86180.82730.80010.7021
130.91660.89300.86900.83610.81010.7156
140.92080.89840.87550.84410.81920.7281
150.92450.90320.88140.85140.82750.7395
160.92790.90760.88680.85800.83500.7501
170.93090.91150.89160.86400.84190.7599
180.93370.91520.89610.86950.84830.7690
190.93630.91850.90010.87460.85410.7774
200.93870.92160.90390.87930.85950.7853
210.94090.92440.90740.88370.86460.7927
220.94290.92700.91060.88770.86920.7997
230.94480.92950.91360.89150.87360.8061
240.94650.93170.91640.89500.87770.8123
250.94820.93390.91900.89830.88150.8180
300.95500.94270.92990.91210.89760.8424
350.96010.94940.93820.92250.90980.8612
400.96420.95460.94460.93070.91940.8761
450.96740.95880.94980.93730.92710.8882
500.97010.96220.95410.94270.93350.8982
550.97230.96510.95770.94720.93880.9066
600.97420.96760.96070.95110.94330.9137
650.97590.96970.96330.95440.94720.9199
700.97730.97160.96560.95730.95060.9252
750.97860.97320.96760.95980.95360.9299
800.97970.97460.96940.96210.95620.9340
850.98070.97590.97100.96410.95850.9376
900.98160.97710.97240.96590.96060.9409
950.98250.97810.97370.96750.96250.9439
1000.98320.97910.97480.96890.96420.9465
样本量概率 (P)
0.20.10.050.020.010.001
1100.98450.98070.97680.97150.96720.9511
1200.98560.98210.97860.97360.96970.9550
1300.98660.98330.98000.97550.97180.9583
1400.98740.98440.98130.97710.97370.9612
1500.98810.98530.98240.97850.97530.9637
1600.98880.98610.98340.97970.97670.9658
1700.98930.98680.98430.98080.97800.9677
1800.98990.98750.98510.98170.97910.9695
1900.99030.98810.98580.98260.98010.9710
2000.99070.98860.98640.98340.98110.9724
Sample sizeProbability (P)
0.20.10.050.020.010.001
1100.98450.98070.97680.97150.96720.9511
1200.98560.98210.97860.97360.96970.9550
1300.98660.98330.98000.97550.97180.9583
1400.98740.98440.98130.97710.97370.9612
1500.98810.98530.98240.97850.97530.9637
1600.98880.98610.98340.97970.97670.9658
1700.98930.98680.98430.98080.97800.9677
1800.98990.98750.98510.98170.97910.9695
1900.99030.98810.98580.98260.98010.9710
2000.99070.98860.98640.98340.98110.9724

39 41 28 46 13 14 77 63 28 82 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

273009491965872782859612151720253035404550607580951001201501702002503003504004505006007508009501000120015001700200025003000350040004500500060007500800095001000012000150001700020000250003000035000400004500050000600007500080000950001000001200001500001700002000002500003000003500004000004500005000006000007500008000009500001000000120000015000001700000200000025000003000000350000040000004500000500000060000007500000800000095000001000000012000000150000001700000020000000250000003000000035000000400000004500000050000000600000007500000080000000950000001000000001200000001500000001700000002000000002500000003000000003500000004000000004500000005000000006000000007500000008000000009500000001000000000120000000015000000001700000000200000000025000000003000000000350000000040000000004500000000500000000060000000007500000000800000000095000000001000000000012000000000150000000001700000000200000000025000000003000000000350000000040000000004500000000
273009491965872782859612151720253035404550607580951001201501702002503003504004505006007508009501000120015001700200025003000350040004500500060007500800095001000012000150001700020000250003000035000400004500050000600007500080000950001000001200001500001700002000002500003000003500004000004500005000006000007500008000009500001000000120000015000001700000200000025000003000000350000040000004500000500000060000007500000800000095000001000000012000000150000001700000020000000250000003000000035000000400000004500000050000000600000007500000080000000950000001000000001200000001500000001700000002000000002500000003000000003500000004000000004500000005000000006000000007500000008000000009500000001000000000120000000015000000001700000000200000000025000000003000000000350000000040000000004500000000500000000060000000007500000000800000000095000000001000000000012000000000150000000001700000000200000000025000000003000000000350000000040000000004500000000

49 93 74 94 78 47 17 30 59 07 57 53 58 58 93 61 89 30 60 61 35 87 87 20 95 97 85 46 34 21 47 43 34 14 62 54 06 48 31 17 65 33 06 11 86 63 76 4 83 30 10 70 73 79 63 13 20 95 35 33 10 73 14 81 48 00 29 23 28 52 69 80 70 29 54 79 72 67 29 08 56 84 87 99 32 99 80 22 50 96 34 17 62 24 27 69 02 52 76 77 27 27 14 43 07 50 51 48 86 37 48 21 1 08 84 67 39 53 16 54 33 96 04 67 56 50 83 05 11 40 98 45 17 48 93 75 66 25 65 46 96 04 65 56 50 50 83 05 11 40 98 47 17 48 93 75 66 53 35 83 05 11 40 98 45 17 48 93 75 66 53 35 83 05 11 40 98 45 17 48 93 75 66 53 35 83 05 11 40 98 45 17 48 93 75 66 53 35

附录 B 中表格的来源 Sources of Tables in Appendix B

B1 由 NAG* 程序计算得出。
B1 derived using NAG* routine.

B2 由 NAG* 程序计算得出。
B2 derived using NAG* routine.

B3 经 Lentner (1982) 许可改编。
B3 adapted from Lentner (1982) with permission.

B4 由 NAG* 程序计算得出。
B4 derived using NAG* routine.

B5 经 Fisher 和 Yates (1963) 许可改编。
B5 adapted from Fisher and Yates (1963) with permission.

B6 由 NAG* 程序计算得出。
B6 derived using NAG* routine.

B7 由 NAG* 程序计算得出。
B7 derived using NAG* routine.

B8 来源于 Glasser 和 Winter (1961),经修正,获 Biometrika 受托人许可。
B8 from Glasser and Winter (1961) with corrections, with permission of Biometrika trustees.

B9 经 Lentner (1982) 许可改编。
B9 adapted from Lentner (1982) with permission.

B10 经 Lentner (1982) 许可改编。
B10 adapted from Lentner (1982) with permission.

B11 经 Gardner 和 Altman (1989c,第119页) 许可使用。
B11 from Gardner and Altman (1989c, p. 119) with permission.

B12 直接取自 Royston (1983) 的公式。
B12 directly from formulae in Royston (1983).

B13 使用 STATA 中的随机数生成器获得。
B13 obtained using random number generator in STATA.

习题答案 Answers to exercises

这些解答通常不包括作为分析常规部分通常会生成的图表。
These solutions do not in general include the graphs that would usually be produced routinely as part of the analysis.

常用缩写有:
Common abbreviations are:

SD 标准差
SD standard deviation

SE 标准误
SE standard error

CI 置信区间
CI confidence interval

df 自由度
df degrees of freedom

第3章

CHAPTER 3

3.1 (a) 截尾数据。
3.1 (a) Censored.

(b) 我们不知道SI的上限。未截尾的数据表明分布呈正偏态(偏右)。
(b) We do not know the upper limit of SI. The uncensored data show that the distributions are positively skewed (skewed to the right).

(c) 数据呈偏态分布,且部分值被删失。
(c) The data are skewed and some values are censored.

(d) 分别为3.8和22.3。
(d) 3.8 and 22.3 respectively.

(e)
(e)

(f) 无不良反应 有不良反应
(f) Without adverse reactions With adverse reactions

2 2 99

3 135799 3 89

4 1234489 4 1244669999

5 1337899 5 01113333347799

6 11457 6 12237788

7 122 7 4

(g) 两组的中位年龄可从茎叶图中轻松得出,分别为52岁和53岁,表明不良反应与年龄无关。
(g) The median ages of the two groups are easily obtained from the stem-and-leaf diagrams as 52 and 53, suggesting that adverse reactions are not age-related.

3.2 (a) 答案取决于“更可能”的解释。图表显示涉及职业飞行员的事故多于其他任何群体,但未提供个体风险的信息。
3.2 (a) The answer depends upon the interpretation of 'more likely'. The Figure shows that more accidents involve professional pilots than any other group, but gives no information of the risk per individual.

(b) 当根据各组飞行量调整数据后,显然职业飞行员的风险最低。风险最高的是家庭主妇和学生。不同的答案可由飞行量与事故风险之间的强负相关关系解释。
(b) When the figures are adjusted for the amount of flying in each group it is clear that professional pilots had much the lowest risk. The highest risks were among housewives and students. The different answers are explained by the strong negative relation between the amount of flying and the risk of an accident.

3.3 图3.12显示了 、25、50、75 和 百分位数。所需观察值的排名是样本量加1(299)乘以0.025、0.25、0.5、0.75 和 0.975。
3.3 Figure 3.12 shows the , 25, 50, 75 and centiles. The ranks of the

这些值分别为7.475、74.75、149.5、224.25 和 291.525。可从表3.4中找到这些排名两侧的观察值(计算累计频数列有助于查找),并通过插值法获得所需的百分位数。然而,在每种情况下,所需排名两侧的观察值相同,因此IgM分布的百分位数简单地为0.2、0.5、0.7、1.0 和
required observations are thus the sample size plus 1 (299) multiplied by 0.025, 0.25, 0.5, 0.75 and 0.975. These values are 7.475, 74.75, 149.5, 224.25 and 291.525. The observations with ranks either side of these values can be found from Table 3.4 (it helps to calculate a column of cumulative frequencies), and interpolation used to get the required centiles. However, in each case the observations with ranks either side of the required rank are the same, so the centiles of the distribution of IgM are obtained simply as 0.2, 0.5, 0.7, 1.0 and .

第4章

CHAPTER 4

4.1 0.023 或 (来自表 B1)。
4.1 0.023 or (from Table B1).

4.2 按照第 4.9.1 节给出的方法,我们需要计算获得 0、1 或 2 品脱 B 型血的概率。因此,我们需要计算从 100 个样本中选出这些数量的方法数,即:
4.2 Following the method given in section 4.9.1, we need to calculate the probabilities of obtaining 0, 1, or 2 pints of group B blood. We thus need to calculate the number of ways of choosing these numbers from a sample of 100, which are:

使用第 4.9.1 节中的公式,所需概率为
Using the formula in section 4.9.1, the required probability is

获得少于三品脱 B 型血的概率约为 1/100,即 0.01。
The probability of getting fewer than three pints of group B blood is about 1 in 100, or 0.01.

4.3 由于男孩的概率略高于女孩,男孩最多的序列最可能出现,即最后一个序列。其他两个序列出现的概率相等。
4.3 As the probability of a boy is slightly greater than the probability of a girl, the sequence with the most boys is most likely, which is the last sequence. The other two sequences are equally likely.

4.4 (a) 0.0013。
4.4 (a) 0.0013.

(b)
(b)

4.5 (a) 这个问题可以反过来考虑,即我们希望评估必要的样本量,使得所有儿童检测结果均为阴性的概率小于 0.05。数学上,我们需要样本量 ,使得 。通过反复试验可知,。(如果数学功底扎实,你可能会计算出 是大于 的最小整数。)
4.5 (a) The question can be reversed, so that we wish to evaluate the necessary sample size so that the probability of all the children being negative to the test is . In mathematics, we require the sample size , such that . It is simple to show, by trial and error, that we need . (If your maths is good, you might have evaluated as the smallest integer greater than .)

(b) 它没有影响。
(b) It has no effect.

4.6 一开始的最小高度是
4.6 At the start the minimum height was

距离均值的标准差。从表B1中,身高超过该值的男性比例为0.7422(约74%)。在25年期末,最低身高为个标准差。从表B1中,身高高于最低值的男性比例现在为0.8888,约89%。不合格男性的比例减少了一半以上。
standard deviations from the mean. From Table B1, the proportion of men taller than this would be 0.7422 (about ). At the end of the 25 year period the minimum height was standard deviations from the mean. From Table B1, the proportion of men above the minimum is now 0.8888, or about . The proportion of ineligible men has more than halved.

4.7 如果真实血压没有变化,三次测量中任意一次位于另外两次之间的概率相等。第三次测量不在前两次之间的概率为三分之二(0.67),不考虑测量值相等的可能性。没有理由期望第三次测量一定在前两次之间,也没有理由因为不在其中就认为测量不可靠。如果分析中使用三次读数的平均值,只有两次测量者的平均值估计会较差。
4.7 If there is no change in the true blood pressure, each of a sequence of three measurements is equally likely to be in between the other two. The probability that the third measurement will not fall between the first two is thus two thirds (0.67), discounting the possibility of equal measurements. There is no reason to expect the third measurement to be between the first two and no reason to discard measurements as unreliable if they do not. If the intention is to use the average of the three readings in an analysis, the averages would be less well estimated for those with only two values.

4.8 (a) 每个孩子未受影响的概率是0.75。两个孩子均未受影响的概率是,因为这些事件相互独立。
4.8 (a) The probability of each child being unaffected is 0.75. The probability of two children being unaffected is , as these are independent events.

(b) 概率是0.75。每个孩子的概率相同,与前面孩子的结果无关。
(b) The probability is 0.75. Each child has the same probability, regardless of the outcome for previous children.

(c) 双方父母均为异常基因杂合子的概率是。因此,每年预期患囊性纤维化的婴儿数为,约为2个。
(c) The probability of both parents being heterozygous for the abnormal gene is . The expected number of babies with cystic fibrosis per year is thus , which is about 2.

第5章 CHAPTER 5

5.1 (a) 不,这绝对是错误的。如果我们想观察一组受试者行为是否发生变化,应重新检查整组。只研究选定的子集会导致重新检查的样本有偏,从而使结果有偏。
5.1 (a) No, it is definitely wrong. If we want to see if a group of subjects has shifted its behaviour in some way we should re- examine the whole group. To study only a selected subset is to bias the sample being re- examined and thus to bias the results.

(b) 第二次研究的响应率为69%,相当低。未响应者占近三分之一,可能具有非典型特征(这通常是事实),可能包括饮酒量较大者。例如,有些人可能因饮酒相关疾病而病重无法响应。因此,高未响应率可能导致结果偏倚。作者应比较首次调查中响应者与未响应者的特征。论文未给出首次调查的响应率。如果首次调查响应率也约为70%,这很可能,那么第二次调查的最终样本将更加高度选择性。
(b) The response rate in the second study was which is rather poor. The non-responders, nearly a third of the sample, might well be atypical (this is what is usually found), and could well include heavier drinkers. For example, some may have been too ill to respond through a drink-related illness. Thus the high rate of non-response could have biased the results. The authors should have compared the characteristics of responders and non-responders with respect to their responses to the first survey. The paper does not give the response rate to the first study. If it was also around , which is quite likely, then the final sample interviewed at the second survey would be even more highly selected.

(c) 这不是好主意,因为饮酒习惯全年并不一致。
(c) It is not a good idea, because drinking habits are not consistent throughout the year.

(d) 不。首先,我们不能简单地将两个同时发生的时间变化解释为因果关系。
(d) No. First, we cannot necessarily interpret two simultaneous

其次,我们会预期他们发现报告的酒精消费量有所减少,因为样本被偏向只包括第一项研究中饮酒的人。如果他们只重新采访第一项研究中不饮酒的人,我们会预期发现饮酒量有所增加。这是所谓“回归均值”现象的一种表现,发生在我们对先前某次测量中在某一限定范围内选取的样本重新测量同一变量时。因此,即使时间上没有变化,这项研究也可能显示出酒精消费的减少。
changes over time as being causally related. Second, we would expect them to have found a reduction in the reported alcohol consumption, because the sample was biased to include only those drinking in the first study. If they had re- interviewed only those not drinking in the first study, we would expect them to have found an increase. This is one form of the phenomenon known as 'regression to the mean', and occurs when we remeasure a quantity on a sample selected by a restricted range of the same quantity on a previous occasion. Thus even when there has been no change over time, this study would be expected to have shown a decrease in alcohol consumption.

(e) 不。基于上述理由,他们绝非具有代表性。此外,他们是样本,而非总体。
(e) No. For reasons given above, they are by no means representative. Further, they are a sample, not a population.

(f) 由于上述原因,该解释无效。
(f) The interpretation is not valid for the reasons noted above.

(g) 不。
(g) No.

5.2 (a) 这是一项横断面研究。
5.2 (a) It was a cross- sectional study.

(b) 如果目标人群是所有英国绝经期女性,那么该全科诊所的代表性就很重要。我们没有相关信息,尽管在单一诊所进行此类研究似乎并不不合理。所有1930年出生的132名女性均被调查,因此不存在选择偏倚。然而,我们假设患者登记册准确完整,且有21/132人无法联系,这对准确性提出了质疑。在这132名女性中,只有31人实际参与了研究,原因各异。大多数排除理由合理,但为何排除未婚女性尚不清楚。样本似乎相当具有代表性。
(b) If the population of interest is taken as all menopausal British women, then the representativeness of this general practice is relevant. We have no information about this, although it does not seem unreasonable to carry out this type of study in a single practice. All 132 women born in 1930 were investigated, so there was no selection bias. However, we are assuming that the register of patients is accurate and complete, and the fact that 21/132 were not contactable casts some doubt on this. Of these 132 women, only 31 were actually studied, for the various reasons stated. Most of the exclusions are reasonable, although it is not clear why the unmarried women were excluded. The sample appears reasonably representative.

(c) 该研究的主要问题是,如果服用避孕药会延迟绝经,那么在研究时,一些绝经被延迟的女性仍处于绝经前期,因此被排除在研究之外。该设计无法实现研究目标。
(c) The major problem with this study is that if the use of the pill delays the menopause then at the time of the study some women who will have had their menopause delayed will still have been premenopausal and so excluded from the study. The design does not allow the research objective to be investigated.

(d) 这可以通过队列研究来回答。例如,研究者可以选取所有1930年出生的女性,等待她们全部绝经后再进行比较,这样可以得出有效结论。然而,应注意,不应将所有避孕药使用者简单合并,无论其使用时长或使用年龄如何。队列设计允许探讨这些因素与绝经年龄的关系。
(d) This is a question that could be answered by a cohort study. If the researcher had taken, for example, all the women born in 1930 and waited until they had all reached the menopause, then he could make a valid comparison. However, it should be noted that it is not really good practice to lump together all pill users regardless of the length of pill use or the age at which it was taken. The cohort design would allow these factors to be investigated in relation to the age of menopause.

5.3 (a) 棒球运动员显然不能真正代表总体,但无法判断这在本例中是否重要。然而,左撇子比例可能较一般人群偏高(14%)。
5.3 (a) Baseball players are clearly not truly representative of the population, but it is impossible to assess whether it matters in this particular case. However, it appears that there might have been a higher proportion of left- handers (14%) than we would expect in the population.

(b) 如果左撇子的流行率在二十世纪有所增加,或者不同社会群体(具有不同死亡率)中的流行率发生了变化。这两种情况都是可能的。
(b) If the prevalence of left-handedness had increased during the twentieth century, or if the prevalence within different social groups (with different mortality rates) had changed. Both of these are likely.

(c) 分析会偏向包含出生较早的人,因为大多数最近的球员仍然健在。很可能在世纪初左撇子较少见。这两个事实意味着分析会偏向于左撇子较早死亡。此外,排除仍然健在者来分析平均死亡年龄是误导的。正确的生存数据分析方法(同时考虑幸存者数据)在第13章中有描述。
(c) The analysis would be biased towards including those born a long time ago as most recent players would still be alive. It is likely that left-handedness was less common earlier in the century than it is now. These two facts mean that the analysis would be biased towards earlier death among left-handers. Further, it is misleading to analyse mean age at death excluding those still alive. The correct analysis of survival data, which takes account of data for survivors too, is described in Chapter 13.

(d) 理想情况下,应选择一批出生时间相近的人群,例如同一年入学的学生。为了获得合理的死亡比例,幸存者年龄应至少达到70岁。这样的数据极不可能存在,且前瞻性研究耗时极长!如果采用适当方法分析棒球数据,并在分析中考虑出生年份,结果会更有效。
(d) It would be desirable to take a cohort of people born at about the same time, for example all those in the same year at school. In order to get a reasonable proportion of deaths, the survivors would need to be at least 70. It is most unlikely that such data are available and a prospective study would take a very long time! The baseball data would yield a more valid answer if analysed using an appropriate method, and if year of birth was considered in the analysis.

第7章 CHAPTER 7

7.1 图A7.1显示了数据的散点图。最偏离总体趋势的点位于左下角,对应患者19。该患者的楔压实际为
7.1 Figure A7.1 shows a scatter diagram of the data. The point that is most distant from the general trend is in the bottom left- hand corner, corresponding to patient 19. This patient's wedge pressure was actually .

7.2 下表显示了计算过程:
7.2 The following table shows the calculations:

T4logcT4P正态分数
1715.140.0309-1.868
2575.550.0802-1.403
2885.660.1296-1.128
2955.690.1790-0.919
3965.980.2284-0.744
3975.980.2778-0.589
4316.070.3272-0.448
4356.080.3765-0.315
5546.320.4259-0.187
5686.340.4753-0.062
7956.680.52470.062
9026.800.57410.187
9586.860.62350.315
10046.910.67280.448
11047.010.72220.589
12127.100.77160.744
12837.160.82100.919
13787.230.87041.128
16217.390.91981.403
24157.790.96911.868
T4logcT4PNormal score
1715.140.0309-1.868
2575.550.0802-1.403
2885.660.1296-1.128
2955.690.1790-0.919
3965.980.2284-0.744
3975.980.2778-0.589
4316.070.3272-0.448
4356.080.3765-0.315
5546.320.4259-0.187
5686.340.4753-0.062
7956.680.52470.062
9026.800.57410.187
9586.860.62350.315
10046.910.67280.448
11047.010.72220.589
12127.100.77160.744
12837.160.82100.919
13787.230.87041.128
16217.390.91981.403
24157.790.96911.868

正态分数可以通过许多统计软件获得。否则,可通过反向使用表B1获得。这些数据的正态图非常直,如图A7.2所示。
The Normal scores can be obtained in many statistics packages. Otherwise they can be obtained by using Table B1 'in reverse'. The Normal plot of these data is very straight, as is shown in Figure A7.2.

7.3 末位数字(剂量SA时为倒数第二位数字)的分布如下:
7.3 The terminal digits (or penultimate digits in the case of dose of SA) are distributed as follows:

0123456789总计
年龄111610722751465
SA224060102000365
SI1605342347448
01234Digit 56789Total
Age111610722751465
SA224060102000365
SI1605342347448

数字偏好在测量中是一种可能出现的现象,但对于诸如年龄之类的信息则不应出现。然而,大量年龄以9结尾的情况相当奇怪。
Digit preference is a likely phenomenon for measurements, but would not be expected for information such as age. Nevertheless, the large number of ages ending in 9 is rather odd.

SA和SI都显示出明显的数字偏好,这很难解释。特别是SA的总剂量,这种偏好尤为奇特,尤其因为它代表了至少六个月内剂量的总和。SI是两个百分比的比值,因此大量的零表明记录不够精确。
Both SA and SI show dramatic digit preference which is hard to explain. That for the total dose of SA is peculiar, especially as it represents the sum of doses given over at least six months. The SI is the ratio of two percentages, so the surfeit of zeros suggests imprecise recording.

【7】4 近一半(27/60)的数值以零结尾。其他末尾数字分布较为均匀。这种现象可能是由于不同观察者报告数据时的精确度不同所致。
7.4 Almost half (27/60) of the values end in zero. The other terminating digits are fairly evenly spread. This effect may be due to different observers reporting data to different precision.

第8章 CHAPTER 8

【8】1 (a) 较大的医院,仅仅因为分娩数量更多。
8.1 (a) The larger hospital, simply because there are more births.

(b) 较小医院中男婴比例的日常波动更大,因此在任何一天男婴比例超过60%的可能性更高。
(b) The day to day variation in the proportion of boys will be greater in the smaller hospital, so it is more likely to have more than of babies being boys on any day.

【8】2 (a) 均值的标准误差(SE)是 ,即
8.2 (a) The SE of the mean is , which is

(b) 均值的95%置信区间(CI)范围是均值减去1.96倍SE到均值加上1.96倍SE,因此CI的宽度是 。若该宽度等于 ,则需要 。假设标准差仍为 ,则有 ,解得
(b) The CI for the mean is the range from mean SE to mean SE, so the width of the CI is . For this to be equal to we need . Assuming that we still have , then we would have which gives

【8】3 另一种表述问题的方式是:“一组中数值小于40或大于60的概率是多少?”患者分配到某种治疗的比例的抽样分布是二项分布。该分布的标准差是 。这里 ,所以标准差是 。使用正态近似(对于 非常合适),所需概率是对应于 的双尾面积,根据表B2为0.0455,约为5%。
8.3 Another way of phrasing the question is, 'What is the probability of the number in one group being less than 40 or more than 60?'. The sampling distribution for the proportion of patients allocated a particular treatment is the Binomial distribution. The SD of the distribution is . Here , so the SD is . Using the Normal approximation, which is excellent for , the required probability is the two- tailed area corresponding to , which from Table B2 is 0.0455, or about .

【8】4 使用单尾检验时应始终明确说明并加以理由。
8.4 The use of a one- tailed test should always be specified and justified.

这种理由,我认为很少合适,是实验者只对某一特定方向的差异感兴趣。以前分析的结果不足以作为进行单尾检验的充分理由。
The justification, which I believe is rarely appropriate, is that the experimenters are only interested in a difference in a particular direction. Results of previous analyses are not an adequate justification for performing a one- tailed test.

第9章 CHAPTER 9

【9】1 (a) 如果变化量近似服从正态分布,那么可以使用第9.4.1节和9.5.1节中描述的分布来获得置信区间,并可使用配对检验(或等价地,对差值进行单样本检验)。如果差值呈现近似对称但非正态分布,则可以使用匹配数据的Wilcoxon检验。
9.1 (a) If the changes have a reasonably Normal distribution, then a CI could be obtained using the distribution as described in sections 9.4.1 and 9.5.1, and a paired test could be used (or, equivalently, a one sample test of the differences). A Wilcoxon test for matched data could be used if the differences had a reasonably symmetric but non- Normal distribution.

变化量的正态概率图相当直线,检验结果为,因此上述任一方法均适用。由于变化量近似正态分布,两种方法结果非常接近。采用参数方法,反措施组仰卧心率平均变化的95%置信区间为1.38至12.38次/分钟。配对检验得)。数据表明反措施可能降低了心率,但正确的做法是将此组与未采用反措施的组进行比较,详见下文。
A Normal plot of the changes is quite straight, and the test gives , , so either of the above methods is appropriate. Because the changes have a nearly Normal distribution they will give very similar results. Using the parametric approach, the confidence interval for the mean change in supine heart rate in the countermeasure group is 1.38 to 12.38 beats/min. The paired test gives ( ). The data thus appear to show that the countermeasure has reduced heart rate, but the correct approach is to compare this group with a group who did not adopt the countermeasure, as discussed below.

(b) 未采用反措施组的心率变化未显著偏离正态分布(),且两组的标准差非常相似,因此可以使用两样本检验比较两组心率变化的差异。平均变化差为10.56,95%置信区间为1.62至19.50,提供了反措施有效性的部分证据。两样本检验结果为)。
(b) The changes in heart rate in the group not adopting the countermeasure were not significantly non-Normal ( , ), and the SDs in the two groups were very similar, so we can use a two sample test to compare the changes in heart rate in the two groups. The difference between the mean changes is 10.56 and the CI is 1.62 to 19.50, giving some evidence in support of the effectiveness of the countermeasure. The two sample test gives ( ).

(c) 将同一人的多次观察当作来自不同个体的数据进行分析是不正确的。这里这种影响可能很小,因为只有两名宇航员被重复计入。论文中未标明重复数据。
(c) It is incorrect to analyse multiple observations on the same individuals as if they were from different people. Here the effect is likely to be minimal as only two astronauts were included twice. The duplicate data were not identified in the paper.

(d) 在临床研究中,允许受试者自行选择治疗方案是极不理想的。以这种方式进行的任何临床试验都不会具备可信度。理想情况下(从研究角度看),宇航员应被随机分配接受饮食对策,但这项研究并非设计为前瞻性研究。宇航员之间可能存在的同质性,例如体能方面,或许会减弱志愿者效应。显然,两组的飞行前心率非常相似,这增强了研究结果的可信度。最终,结果的有效性仍需依赖判断。
(d) In clinical research it is highly undesirable to let subjects choose their own treatments. No clinical trial conducted in this way would have credibility. Ideally (from the research point of view) the astronauts should have been randomized to receive the dietary countermeasure, but this was not set up as a prospective study. The likely homogeneity of the astronauts, for example with respect to fitness, would probably lessen the volunteer effect. Clearly the pre-flight heart rates in the two groups were very similar, which strengthens the findings. In the end the validity of the results is a matter of judgement.

9.2 (a) 配对 检验得到 。然而,该检验假设差值服从近似正态分布,而这里显然不满足这一条件。
9.2 (a) The paired test gives , . However, the test

assumes that the differences have a reasonably Normal distribution, which is clearly not the case here.

(b) 即使对原始数据进行了对数转换,差值的对数值仍呈偏态分布。虽然配对 检验给出 ,但更合适的方法是使用威尔科克森配对符号秩检验,该检验给出
(b) Even after log transformation of the original data the differences between the log values have a skewed distribution. Although the paired test gives , , it is better to use the Wilcoxon matched pairs signed ranks test, which gives , .

9.3 如图9.2和9.3所示,取对数变换使这些数据更接近正态分布。霍奇金病和非霍奇金病患者的计数均值分别为6.487和6.089,差值为0.398(标准误=0.212)。90%置信区间为0.041到0.756。这些值的反对数分别为1.04和2.13,给出了两组计数比率的90%置信区间。比率的最佳估计为
9.3 As was shown in Figures 9.2 and 9.3, log transformation makes these data much nearer to Normal. The mean counts for the Hodgkin's and non- Hodgkin's disease patients were 6.487 and 6.089 respectively, giving a difference of 0.398 (SE = 0.212). The 90% CI is 0.041 to 0.756. The antilogs of these values are 1.04 and 2.13, which give a 90% CI for the ratio of the counts in the two groups. The best estimate of the ratio is .

9.4 线性类比量表数据不满足基于分布的参数方法的分布假设。可以使用非参数的Mann-Whitney检验比较两组,结果为。因此,有强有力的证据表明接受主动治疗的患者恶心症状较轻。
9.4 The linear analogue scale data do not meet the distributional assumptions for parametric methods based on the distribution. The groups can be compared using the non- parametric Mann- Whitney test, which gives , . There is thus strong evidence that nausea was less severe in patients receiving the active treatment.

中位数差异的95%置信区间可用Campbell和Gardner(1989)的方法获得,为15到49毫米。总体而言,对于此类数据,估计值和置信区间的价值有限,因为测量值没有直接的解释意义。
A 95% confidence interval for the difference in median scores can be obtained using the method described by Campbell and Gardner (1989), and is 15 to 49 mm. In general estimates and confidence intervals are of limited value for data like these as the measurements have no straightforward interpretation.

9.5 (a) 标准差可通过将标准误乘以计算得到,如下:
9.5 (a) The SDs can be calculated by multiplying the SEs by to give

香烟数样本量 n均值标准差 SD
1-9250.310.40
10-19570.420.75
20-29990.871.89
30-39381.031.54
> 40281.563.02
未说明250.560.80
CigarettesnMeanSD
1-9250.310.40
10-19570.420.75
20-29990.871.89
30-39381.031.54
&gt; 40281.563.02
Unspecified250.560.80

如同所有情况一样,标准差大于均值(且不可能出现负值),尿液中可替宁排泄值呈明显右偏分布。
As in all cases the SD is greater than the mean (and negative values are impossible), the urinary cotinine excretion values are highly skewed to the right.

(b) 某种趋势分析方法,比如单因素方差分析中的趋势检验。
(b) Some form of analysis of trend, such as within a one way analysis of variance.

(c) 如果数据将通过方差分析进行处理,那么我们要求各组的标准差相似(理论上,各组是来自同一标准差总体的样本)。而标准误则没有此要求,因为它们部分依赖于
(c) If the data are to be analysed by analysis of variance then we require standard deviations to be similar (in theory the groups are samples from populations with the same standard deviation). There is no such requirement for standard errors, which are partly dependent

关于样本量。如上所示,数据不满足这一要求。如果对数转换能产生相似的标准差且分布近似正态,那么可以在单因素方差分析中应用线性趋势。必须排除“未指定”组。最简单的方法是计算吸烟数量与尿液中可替宁水平之间的秩相关。
on sample size. As shown above, the data do not meet this requirement. If log transformation would yield similar SDs and reasonably Normal distributions, then a linear trend could be applied in a one way analysis of variance. The 'unspecified' group would have to be excluded. The simplest approach would be to calculate the rank correlation between number of cigarettes smoked and urinary cotinine level.

(d)该分析无效的原因有三:
(d) There are three reasons why this analysis is not valid:

(i) 数据在每个组内高度偏斜;
(i) the data are highly skewed within each group;

(ii) 标准差变化巨大;
(ii) the SDs vary enormously;

(iii) 使用多重成对组比较来评估吸烟与尿液可替宁水平之间的关系是一种较差的方法,因为它没有考虑组的顺序性。
(iii) the use of multiple comparisons of pairs of groups is an inferior method of assessing whether there is a relation between smoking and urinary cotinine levels as it takes no account of the ordering of the groups.

9.6 组1和组2数据的成对Wilcoxon检验的P值分别为0.01和0.09,相差不大。更具启示性的是变化的均值和标准差:
9.6 The P values associated with paired Wilcoxon tests of the data for Groups 1 and 2 are 0.01 and 0.09, which are not so far apart. More revealing are the means and standard deviations of the changes:

均值标准差
组1-0.0780.073
组2-0.0710.129
MeanSD
Group 1-0.0780.073
Group 2-0.0710.129

两组的平均变化几乎相同。
The mean changes in the two groups are almost the same.

比较组的正确方法是直接检验两组变化的差异。两样本检验在22个自由度下得到)。标准差差异较大,因此检验的假设可能不合理。然而,Mann-Whitney检验得到类似结果()。因此,没有证据支持组间存在差异。
The correct way to compare the groups is by testing directly the difference between the changes in the two groups. A two sample test gives on 22 degrees of freedom . The SDs are rather different, so that the assumptions of the test may not be considered reasonable. A similar result is, however, obtained from a Mann- Whitney test . There is thus no evidence to support the idea that the groups differ.

9.7 (a) 由于数据不适合检验,治疗后评分应使用Mann-Whitney检验比较。该分析得到),强烈支持 gestrinone 比安慰剂更有效改善患者评分。
9.7 (a) The post- treatment scores should be compared by the Mann- Whitney test because the data are not suitable for the test. This analysis gives , strong evidence that gestrinone is more effective than placebo in improving the scores of these patients.

(b) 同样的方法可用于评分变化的比较。该分析得到),结果略弱。或者,可用符号检验比较组间正负变化。
(b) The same method can be applied to the changes in scores. This analysis gives , a result which is only slightly weaker. Alternatively the sign test could be used to compare the groups with respect to positive or negative changes.

9.8 可使用Mann-Whitney检验。由于该检验基于秩次,记录为的截尾值将具有相同秩次。由于该值有大量并列,调整并列秩次是必要的。此外,SI值显然呈偏斜分布,
9.8 The Mann- Whitney test could be used. As it is a test based on ranks, the censored values recorded as would all have the same rank. As there are so many ties at this value the adjustment for ties would be desirable. Also, the SI values clearly have a skewed distribution,

表明非参数方法是合适的。另一种替代但不太理想的方法是使用下一章中描述的方法,比较超过某一给定阈值的比例。
indicating that a non- parametric method would be suitable. An alternative, but less satisfactory, approach would be to compare the proportions above a given cut- off using methods described in the next chapter.

第10章 CHAPTER 10

【10】1 (a) 必须使用适合配对数据的分析方法。对蓖麻油和DNCB呈阴性的比例分别为 。使用第10.4节中给出的方法计算这两个比例差异的 置信区间为从 。检验两个皮肤试验阴性患者比例相同的假设,通过计算
10.1 (a) It is essential to use an analysis appropriate for paired data. The proportions negative to croton oil and DNCB were and respectively. A CI for the difference between these proportions, using the method given in section 10.4, is from to . The hypothesis test that the proportions of patients negative to the two skin tests are the same is evaluated by calculating

根据表B2,该值对应的 。因此,有强有力的证据表明,这些患者中对蓖麻油呈阴性的比例低于对DNCB的比例。
which, from Table B2, corresponds to . There is thus strong evidence that fewer of these patients show negative reactions to croton oil than DNCB.

(b) 对DNCB呈阳性反应的患者比例分别为:第I期为75%,第II期为67%,第III期为41%。趋势卡方检验结果为 ,自由度为1,。因此,有强有力的证据表明DNCB反应性与癌症分期相关。
(b) The proportions of patients with a positive reaction to DNCB are , and for stages I, II and III respectively. The Chi squared test for trend gives on 1 degree of freedom . There is strong evidence therefore that DNCB reactivity is related to stage of cancer.

10.2 (a) 作者似乎得出结论认为 表示不存在效应。然而, 值为0.08,仅略大于0.05。
10.2 (a) The author seems to have concluded that means that there is no effect present. However, the value is 0.08 and so only slightly greater than 0.05.

(b) 所考虑的检验
(b) The test considered in
(a) 不合适,因为它忽略了组别是有序的这一事实。研究的目的是明确探讨睾酮水平(通过兄弟姐妹性别比例这一代理指标来考察)与歌唱声音水平之间的关系,因此应使用趋势卡方检验。该检验在1个自由度下得出 ,具有高度显著性 。因此,我们可以推断歌唱声音水平与兄弟姐妹性别比例之间存在关系,前提是该检验是有效的(但见下文)。当然,这并不直接回答关于睾酮水平的问题,因为该样本中未测量睾酮水平。
(a) is not appropriate because it ignored the fact that the groups were ordered. The aim of the study was explicitly to study the relation between testosterone level (examined by the proxy measure of the sex ratio of siblings) and level of singing voice, so the Chi squared test for trend should be used. The test gives on 1 degree of freedom, which is highly significant . We can therefore infer that there is a relation between level of singing voice and sex ratio of siblings, assuming that the test is a valid one (but see below). This does not, of course, directly answer the question about testosterone levels as they were not measured on this sample.

(c) 这是在检查数据后选择的,因此 值无效。无论如何,忽略顺序是不合理的—趋势检验更为合适。
(c) It was chosen after inspecting the data, so the value is not valid. In any case, it is not sensible to ignore the ordering - the trend test is far preferable.

(d) 表格部分的 值不能超过整个表格的值。该比较的正确值是自由度为2时
(d) The value of for part of the table cannot exceed the value for the whole table. The correct value for this comparison is on 2 degrees of freedom.

(e) 观察值不是独立的,因为422个兄弟姐妹仅涉及195名歌手。大家庭的影响力会大于小家庭。
(e) The observations are not independent, as the 422 siblings related to only 195 singers. Large families will carry more influence than small ones.

(f) 很难说非独立性的重要性。如果这被视为一项初步研究,若结果具有提示性(确实如此),将引导进行直接分析睾酮水平的研究,那么这可能不是太重要。无论如何,没有简单的统计方法可以绕过这个问题。我们可以只研究有一个兄弟姐妹的歌手;为每位歌手随机选择一个兄弟姐妹;或选择每位歌手的长兄弟姐妹,但这些方法都不够理想。完全有效的统计分析将非常复杂。
(f) It is not easy to say how important the non-independence is. If this is considered to be a preliminary study, leading to a study with direct analysis of testosterone levels if the results look suggestive (as they do), then it is probably not too important. In any case, there is no simple statistical way round the problem. We could study only singers with one sibling; choose one sibling at random for each singer; take each singer's oldest sibling, none of which would be very satisfactory. A completely valid statistical analysis would be highly complex.

10.3 (a) 是的,但并非必要。观察频数为2和20,而在原假设下的期望频数约为10,因此可以使用卡方检验。
10.3 (a) Yes, but it was not necessary. The observed frequencies were 2 and 20, but the expected frequencies (under the null hypothesis) were about 10 so the Chi squared test could have been used.

(b) 置信区间基于正态近似。对于安慰剂组中极小的比例,这种近似不成立,导致了不可能的负下限。更根本的是,单独给出每组的置信区间并无帮助。差异比例的95%置信区间更有用,为16%到39%。
(b) The confidence intervals are based on the Normal approximation. This is not valid for the very small proportion in the placebo group, and has led to an impossible negative lower limit. More fundamentally, it is not helpful to give confidence intervals for each group separately. The confidence interval for the difference in proportions is much more useful; it is to .

10.4 (a) 可使用Mann-Whitney检验或趋势卡方检验。两者都需要为每列赋予分数,这些分数可以合理地均匀分布;最明显的是使用1到8的值。
10.4 (a) The Mann- Whitney test or the Chi squared test for trend could be used. Each would require scores to be given to each column. These could reasonably be equally spaced; most obviously the values 1 to 8 could be used.

(b) 当样本量如此庞大时,Mann-Whitney检验难以应用(大多数计算机程序无法对表格执行该检验,而需要3469名儿童的原始数据)。卡方检验无论样本大小都同样容易应用。八组的整体比较得出自由度7时 ),趋势检验得出自由度1时 )。这些数字表明观察到的变异很可能是偶然的,而非男孩睡眠时间长于女孩或反之的趋势。
(b) The Mann-Whitney test is hard to apply when the samples are so large (most computer programs cannot perform the test on a table, but would require data for the 3469 children). The Chi squared test can be applied equally easily regardless of sample size. The overall comparison of the eight groups gives on 7 degrees of freedom , and the trend test gives on 1 degree of freedom . These figures indicate that the variation seen is likely to be due to chance and not to a tendency for boys to sleep longer than girls or vice versa.

10.5 (a) 对于 表只有一个卡方检验,该检验既可解释为比较各行比例,也可解释为比较各列比例。文中描述的两个检验应给出相同的结果。
10.5 (a) There is only one Chi squared test for a table, which can be interpreted as either a comparison of the proportions in each row or in each column. The two tests described should have given the same answer.

(b) 正确的检验统计量是 )或使用Yates校正时的 )。因此,引用的两个结果均不正确。
(b) The correct test statistic is either or depending on whether Yates' correction is used. Thus both of the quoted results were incorrect.

10.6 从表中可以看出,一些患者有三种习惯中的多种,正如预期的那样。因此,计算完整的 表的 是不正确的。如果我们知道每个患者的习惯,
10.6 It is clear from the table that some patients had more than one of the three habits, as would be expected. It is incorrect, therefore, to calculate for the full table. If we knew each patient's habits,

然后可以进行复杂的回归分析,使用第12章中描述的方法。根据现有数据,我们可以为每种习惯构建 表,并计算自由度为2的检验统计量
then a complex regression analysis could be performed, using methods described in Chapter 12. From the available data we could construct tables for each habit and calculate the test statistic on 2 degrees of freedom.

【10】7 阿司匹林组中发展为高血压的女性比例为0.1176(4/34),安慰剂组为0.3548(11/31)。两组差异为0.24,95%置信区间宽泛,为0.04到0.44。采用Yates校正的卡方检验得出 )。因此,有迹象表明阿司匹林可能降低孕妇高血压的风险,但宽泛的置信区间表明对效应大小存在较大不确定性。
10.7 The proportions of women developing hypertension were 0.1176 (4/34) in the aspirin group and 0.3548 (11/31) in the placebo group. The difference is 0.24 with a wide CI from 0.04 to 0.44. The Chi squared test with Yates' correction gives . There is thus a suggestion that aspirin may reduce the risk of hypertension among pregnant women, but the wide CI points to considerable uncertainty about the magnitude of the effect.

【10】8 根据所给信息,可以构建如下 表:
10.8 From the information given a table can be constructed as follows:

病例
+-总计
对照+38846
-202040
总计582886
Cases
+-Total
Controls+38846
-202040
Total582886

其中 分别表示是否暴露于工作中的噪声。
where and refer to presence or absence of exposure to loud noise at work.

(a) 报告暴露于噪声的病例和对照比例分别为0.674(58/86)和0.535(46/86)。比例差为0.14,95%置信区间为0.02到0.26。比例差可用McNemar检验比较,结果为
(a) The proportions of cases and controls reporting exposure to loud noise were 0.674 (58/86) and 0.535 (46/86). The difference in proportions is 0.14, with the CI of 0.02 to 0.26. The proportions can be compared using McNemar's test, which gives

(b) 估计的优势比为 。95%置信区间(第10章未给出方法)为1.05到6.56。
(b) The odds ratio is estimated as . The CI (method not given in Chapter 10) is 1.05 to 6.56.

第11章 CHAPTER 11

【11】1 (a) 是的。删失的生存时间也是最长的,因此获得最高排名。该方法通常不能用于生存数据,因为删失数据导致生存时间的顺序无法确定。
11.1 (a) Yes. The censored survival time was also the longest and thus gets the highest rank. The method cannot generally be used for survival data because the censored data mean that the order of survival times cannot be determined.

(b) 可以计算皮尔逊相关系数,但由于极长的生存时间(即使忽略删失)会严重夸大结果。生存时间的分布高度偏斜,因此秩相关更为合适。
(b) The Pearson correlation coefficient could be calculated but it would be severely inflated by the very long survival time (even ignoring the censoring). The distribution of survival times is highly skewed, so rank correlation is far preferable here.

(c) 乳酸、碳酸氢盐和pH的变化与生存时间的Spearman等级相关系数分别为0.63、,因此变化与碳酸氢盐的关系最强。
(c) The changes in lactate, bicarbonate and pH have Spearman rank

注意碳酸氢盐和与生存时间呈负相关,而乳酸的相关性为正相关。
correlation coefficients with survival time of 0.63, and respectively, so the strongest relation is with changes in bicarbonate. Note that bicarbonate and have negative associations with survival while for lactate the correlation is positive.

11.2 (a) 线性回归方程为
11.2 (a) The linear regression equation is

残差标准差为157.91千卡/24小时。
and the residual SD is 157.91 kcal/24hr.

(b) 散点图显示残差与体重之间无明显关系。残差的正态性检验结果为 。分布基本符合正态—最大残差对应的是基础代谢率最高的女性。总体来看,没有理由拒绝该分析的有效性。
(b) A scatter diagram shows no obvious relation between the residuals and weight. The test of Normality of the residuals gives , . The distribution is reasonably Normal - the largest residual relates to the woman with the highest RMR. Overall, there is no reason to reject the validity of the analysis.

(c) 回归线斜率的标准误为0.9776千卡/24小时,因此95%置信区间为
(c) The SE of the slope of the regression line is 0.9776 kcal/24 hr, so the 95% CI is

或者为5.09到9.03千卡/24小时。
or 5.09 to 9.03 kcal/24 hr.

(d) 残差的标准差为157.91 kcal/24小时,因此在体重均值处的最窄预测区间大约是预测值上下各两倍该数值。因此,不可能将静息代谢率(RMR)从体重预测到250 kcal/24小时以内的精度。
(d) The SD of the residuals is 157.91 kcal/24 hr, so the narrowest prediction interval (at the mean value of body weight) is about twice this amount either side of the predicted value. Thus it is not possible to predict RMR from body weight to within 250 kcal/24 hr.

11.3 (a) 以血糖为自变量对 进行回归,得到方程
11.3 (a) The regression of on blood glucose gives the equation

残差标准差为0.2167。
The residual SD is 0.2167.

(b) 对该回归线残差的正态性检验给出 。经过对数转换后,残差比原始数据分析的残差更接近正态分布。
(b) The test of Normality of the residuals from this regression line gives , . The residuals after log transformation are thus more nearly Normal than those from the analysis of the raw data.

(c) 使用原始Vcf值的回归方程(表11.6),预测的Vcf为 。95%预测区间为 ,即0.97到 。使用对数转换Vcf的回归方程,预测的Vcf为 。95%预测区间为 ,即1.01到 。这两个方程对血糖为16 mmol/l的个体给出相似的预测结果。
(c) From the regression equation using the raw values of Vcf (Table 11.6) the predicted Vcf is . The 95% prediction interval is or 0.97 to . From the above regression equation using log transformed Vcf, the predicted Vcf is . The 95% prediction interval is or 1.01 to . The two equations thus give similar answers for someone with a blood glucose of 16 mmol/l.

11.4 最后两个受试者的数值完全相同,提示可能存在抄录错误或同一患者被重复纳入。
11.4 The values for the last two subjects are identical, suggesting a transcription error or the inadvertent inclusion of the same patient twice.

11.5 对数肌酐清除率(CC)、对数地高辛清除率(DC)和尿流量之间的相关系数如下:
11.5 The correlations between log creatinine clearance (CC), log digoxin clearance (DC) and urine flow are:

rP
DC CC0.838< 0.0001
DC flow0.5150.002
CC flow0.3220.06
rP
DC CC0.838&lt; 0.0001
DC flow0.5150.002
CC flow0.3220.06

这些数据支持第一个陈述,但似乎不支持第二个。通过同时考虑这三个变量,计算偏相关系数,可以得到更好的答案。调整尿流量后的DC与CC的偏相关系数为0.83 ,调整CC后的DC与尿流量的偏相关系数为0.47 (0.005),与简单相关系数几乎无异。因此数据支持关于地高辛清除率的第一个陈述,而不支持第二个。
These figures support the first statement but do not appear to support the second. A better answer is obtained by considering all three variables at once, by calculating the partial correlation coefficients. The partial correlation coefficient between DC and CC adjusting for urine flow is 0.83 , and that between DC and flow adjusting for CC is 0.47 (0.005), hardly different from the simple correlation coefficients. The data thus support the first but not the second of the statements about digoxin clearance.

第12章 CHAPTER 12

12.1 (a) (i) 配对t检验,或等价地,双因素方差分析。
12.1 (a) (i) A paired t test or, equivalently, a two way analysis of variance.

(ii) 同(i)。
(ii) Same as (i).

(b) 感兴趣的数据可重写为
(b) The data of interest can be rewritten as

受试者饮食
NO
10.310.77
20.260.43
30.160.25
40.270.39
50.180.25
SubjectDiet
NO
10.310.77
20.260.43
30.160.25
40.270.39
50.180.25

两种饮食(O-N)差值的均值和标准差分别为0.182和0.160,因此配对t检验得出。自由度为4时,t分布的适当临界值为2.776。该差异在5%显著性水平下尚不完全显著。均值差的95%置信区间为0.12 ± 2.776 × 0.160 / √5,即-0.02到0.38。以饮食和受试者为因素的双因素方差分析比较饮食差异,得到,该值是t值的平方,符合预期。
The mean and SD of the differences between the diets (O- N) are 0.182 and 0.160, so the paired t test gives . The appropriate value of the t distribution on 4 degrees of freedom is 2.776. to the difference is not quite statistically significant at the level . The CI for the mean difference is 0.1222.776 x 0.160/ V5 or - 0.02 to 0.38. The two way analysis of variance, with factors diet and subject, gives for the comparison between the diets - this is the square of the t value expected.

12.2 反向逐步多元回归得到以下模型
12.2 Backwards stepwise multiple regression yields the following model

残差标准差为26.7,而原始FRC的标准差为43.7,
The residual SD is 26.7, compared with the SD of the raw FRC

表明该模型解释了FRC变异性的较大部分。值为
values which was 43.7, indicating that the model explains a good proportion of the variability in FRC. The value of is .

残差的正态性检验给出 ,表明残差近似服从正态分布。
A test of Normality of the residuals gives , , indicating that the residuals have a closely Normal distribution.

【12】3 (a) 受体和供体年龄可以通过两个样本 检验进行比较,log 指数值也可以(原始指数值偏态)。结果如下:
12.3 (a) Recipient and donor ages can be compared by two sample tests, as can the log index values (the raw index values are skewed). The results of these are

无GvHD有GvHD
均值标准差均值标准差t值P值
受体年龄22.45.1828.48.102.730.01
供体年龄23.16.6929.08.072.430.02
Log 指数0.1110.8601.1150.6623.920.0006
No GvHDGvHD
MeanSDMeanSDtP
Recip age22.45.1828.48.102.730.01
Donor age23.16.6929.08.072.430.02
Log index0.1110.8601.1150.6623.920.0006

白血病类型和供体是否怀孕可通过卡方检验与GvHD相关联:
The type of leukaemia and whether the donor had been pregnant can be related to GvHD by Chi squared tests:

白血病类型供体怀孕情况
AMLALLCML
无GvHD6122182
有GvHD54898
Type of leukaemiaDonor pregnancy
AMLALLCMLNoYes
No GvHD6122182
GvHD54898

,自由度2,,自由度1,
on 2 df on 1 df

因此,所有五个变量在是否发生GvHD的患者组间均有显著差异。这提示可能构建一个有用的逻辑回归模型以区分这两组。
Thus all five variables are significantly different between the groups of patients who did and did not develop GvHD. This suggests that it might be possible to find a logistic regression model that discriminates usefully between the groups.

(b) 此处结果基于向后逐步多元逻辑回归,潜在解释变量包括:受体年龄、供体年龄、供体怀孕、log 指数,以及两个虚拟变量表示患者是否患有ALL和CML。
(b) The results given here relate to backward stepwise multiple logistic regression with the following potential explanatory variables: recipient's age, donor's age, donor pregnancy, log index, and two dummy variables indicating whether the patient did
(1) 表示患有,
(1) or did not
(0) 表示未患。以5%显著性水平决定变量是否保留,得到以下模型:
(0) have ALL and CML. Using the level of statistical significance to decide whether to retain a variable in the analysis, the following model is obtained:

变量系数标准误z值P值
常数项-2.546
慢性髓性白血病(CML)2.2511.1062.0350.04
妊娠2.4961.1012.2660.02
对数指标1.4880.7202.0670.04
VariableCoefficientSEzP
Constant-2.546
CML2.2511.1062.0350.04
Pregnancy2.4961.1012.2660.02
Log index1.4880.7202.0670.04

该模型中的三个变量均仅具有中等显著性。
All three variables in this model are only moderately significant.

对于每位患者,我们可以基于此模型计算移植物抗宿主病(GvHD)的概率,并将其与实际发生情况进行比较。概率由逻辑回归模型获得:
For each patient we can calculate the probability of GvHD on the basis of this model, and relate these to what actually happened. The probability is obtained from the logistic regression model:


index

或者
or

需要记住的是,使用同一组数据来评估模型(即用于建立模型的数据)会略显乐观。最好使用新的数据来验证模型。
It should be remembered that the assessment of a model using the same data that were used to derive the model will give a slightly optimistic picture. It is best to use new data to test a model.

(c) 对于逻辑回归模型中的二元变量,优势比(odds ratio)由给出,其中是估计的回归系数。这对应于编码为1的组相较于编码为0的组的优势增加。
(c) For a binary variable in a logistic regression model, the odds ratio is given by where is the estimated regression coefficient. This corresponds to the increased odds associated with being in the group coded 1 compared to the group coded 0.

对于CML,;对于妊娠,。90%置信区间(CI)计算公式为。CML的置信区间为1.54到58.6;妊娠的置信区间为1.98到74.2。两个置信区间都非常宽,表明从如此小的样本中无法获得精确估计。
For CML we have and for pregnancy we have . CIs are obtained as . For CML these values are 1.54 and 58.6; for pregnancy they are 1.98 and 74.2. Both CIs are extremely wide, showing that precise estimates cannot be obtained from a sample this small.

12.4 (a)
12.4 (a)

(b) 。每天吸烟20支相当于每年吸烟 支,所以总计约11万支香烟相当于每天吸烟20支约15年。
(b) . Smoking 20 cigarettes per day is equal to per year, so a total of about 110000 cigarettes is equivalent to smoking 20 per day for about 15 years.

(c) 吸烟的总支数为 ,因此优势比为 。这比非吸烟者有家族史的优势比高出十倍。
(c) The total number of cigarettes smoked is 219000, so the odds ratio is exp . This is ten times the odds ratio for family history in a non-smoker.

12.5 (a) 多元回归模型为
12.5 (a) The multiple regression model is

变量系数标准误t值P值
常数项-6.74594.3923-1.5360.14
年龄-0.02600.0241-1.0800.29
性别-0.80290.5120-1.5680.13
身高0.08800.02523.4970.002
VariableCoefficientSEtP
Constant-6.74594.3923-1.5360.14
Age-0.02600.0241-1.0800.29
Sex-0.80290.5120-1.5680.13
Height0.08800.02523.4970.002

在该模型中,只有身高具有统计学显著性。残差标准差为1.185, 。残差标准差是观测值与回归模型预测值差异的标准差。与肺活量原始值的标准差1.657相比,这表示了显著的减少,但仍表明某些情况下预测误差较大。对于 的病例,我们将
Only height is statistically significant in this model. The residual SD is 1.185 and . The residual SD is the SD of the differences between the observed values and those predicted by the regression model. This represents a considerable reduction in comparison with the SD of 1.657 for the raw lung capacity values, but still indicates a large prediction error in some cases. For of cases we would

预计模型的误差不会超过约2.37升(残差标准差的两倍),这超过了平均肺容量的三分之一(平均肺容量为6.05升)。
expect the model to err by more than about 2.37 l (twice the residual SD), which is more than a third of the mean lung volume (which is 6.05 l).

(b) 肺容量对身高的线性回归方程为
(b) The linear regression equation of lung volume on height is

斜率的标准误差为0.0184,残差标准差为1.227,且 。这些数值表明,多元回归模型对数据的拟合仅略优于以身高为自变量的线性回归,正如第一个模型中年龄和性别系数不显著所示。
The SE of the slope is 0.0184 and the residual SD is 1.227, and . These values suggest that the multiple regression model fits the data only marginally better than the linear regression on height, as was indicated by the non- significant coefficients for age and sex in the first model.

(c) 利用上述肺容量对身高的线性回归结果,对于具有平均肺容量的个体, 预测区间为
(c) Using the above results of linear regression of lung volume on height, the prediction interval for someone with average lung capacity is

或者 3.50 到 8.59 升。
or 3.50 to 8.59 l.

(d) 简单的方法是分别对男性和女性进行回归,得到的斜率分别为 0.0736 和 0.0745。
(d) The simple way is to carry out separate regressions for males and females, which give slopes of 0.0736 and 0.0745.

然而,检验斜率是否相同的正确方法是拟合包含交互项的多元回归模型。正如我们从上述相似的斜率所预期的那样,交互作用远未达到统计显著水平。正如我们也看到的,在多元回归模型中性别没有显著影响,我们可以合理地得出结论,肺活量与身高的关系在男性和女性中是相同的。
However, the correct way to test the hypothesis that the slopes are the same is by fitting a multiple regression model including an interaction term. As we would expect from the similar slopes just given, the interaction is nowhere near to being statistically significant. As we have also seen that there was no significant effect of sex in the multiple regression model, we can reasonably conclude that the relation between lung volume and height is the same for males and females.

第13章

##CHAPTER 13

13.1 如果我们将实验结束前被删失的三个值视为事件,则得到
13.1 If we take the three values censored before the end of the experiment as events, we get

。差异的证据现在弱得多。
giving , . The evidence for a difference is now much weaker.

13.2 (a) 由于最长的生存时间远大于第二长时间,曲线左侧的有意义的早期部分将被严重压缩。如果排除该患者,图形会更有用。如13.8.2节所述,当仅剩五名患者存活时停止绘制曲线,通常能提供更可靠的视觉印象。
13.2 (a) Because the longest survival time is so much greater than the next longest, the meaningful early part of the curve will be severely compressed at the left hand side. The graph is much more useful if this patient is excluded. As noted in section 13.8.2, stopping the curve when there are only five patients still alive will usually give a more reliable visual impression.

(b) 对数秩检验用于比较不同患者组的生存情况。当感兴趣的变量是连续型时,如本例中,
(b) The logrank test is used to compare survival in different groups of patients. Where the variable of interest is continuous, as here, we can

可以将患者分组为对应广泛数值范围的组,并进行趋势的对数秩检验。
create groups of patients corresponding to broad ranges of values and perform the logrank test for trend.

一种常见的方法是将患者分成三个大小相等的组。组大小为10、10和9,结果如下:
A common approach is to divide the patients into three equal sized groups. Groups of size 10, 10 and 9 give the following results:

变量Logrank X²
总体(2 自由度)趋势(1 自由度)P值
乳酸10.199.470.002
碳酸氢盐17.2615.010.0001
pH值6.104.620.03
VariableLogrank X²
Overall (2 df)Trend (1 df)P
Lactate10.199.470.002
Bicarbonate17.2615.010.0001
pH6.104.620.03

对每个变量,趋势均具有统计学显著性,且组间的大部分差异归因于趋势。因此,三个变量的变化均与生存时间相关。
For each variable the trend is statistically significant, and most of the variation between groups is due to the trend. Thus the changes in all of the three variables are related to survival time.

(c) 对三个变量分别构建Cox回归模型,变量既作为连续变量处理,也按前述方法分为三组,结果汇总如下表:
(c) Cox regression models of each of the three variables treated either as continuous or split into three groups as in the previous analysis are summarized in the following table:

b连续变量P值b分组变量P值
SE(b)SE(b)
乳酸-0.0710.0190.001-0.0610.0220.01
碳酸氢盐0.1860.0510.0010.0870.0220.0001
pH值3.9211.7040.030.9650.3950.02
bContinuousPbGroupedP
SE(b)SE(b)
Lactate-0.0710.0190.001-0.0610.0220.01
Bicarbonate0.1860.0510.0010.0870.0220.0001
pH3.9211.7040.030.9650.3950.02

回归系数不应直接比较两种分析方法。三变量均与生存显著相关,但显著性水平不同。通常,分组分析的结果与趋势的logrank检验非常相似。
The regression coefficients, , should not be directly compared for the two types of analysis. All three variables are significantly associated with survival by either method, but the level of significance differs. In general, the grouped analysis will give a very similar answer to the logrank test for trend.

若将三变量同时纳入Cox模型,只有碳酸氢盐具有统计学显著性。
If the three variables are all entered into a Cox model together only bicarbonate is statistically significant.

【13】3 (a) 符号相反意味着一个变量高值和另一个变量低值均与死亡风险增加相关。正回归系数表示该变量高值与较差生存相关,负系数则相反。因此,模型预测非CML患者及有移植物抗宿主病(GvHD)患者生存较差。
13.3 (a) The opposite signs mean that high values of one variable and low values of the other variable are associated with an increased risk of dying. A positive regression coefficient means that high values of that variable are associated with worse survival, and conversely for a negative coefficient. Thus the model predicts that survival is worse for non- CML patients and those with GvHD.

(b) 需要计算每组患者的预后指数,具体如下:
(b) We need to calculate the prognostic index for each group of patients. These are as follows:

预后指数(PI)非GvHD非CML 0.000,非GvHD CML -2.508,GvHD非CML 2.306,GvHD CML -0.202(= -2.508 + 2.306)
PI non- GvHD non- CML 0.000 non- GvHD CML - 2.508 GvHD non- CML 2.306 GvHD CML - 0.202 (= - 2.508 + 2.306)

相对于非GvHD非CML组,死亡的相对风险简单地表示为 ,因为该组的PI为零。因此,其他组相对于非GvHD非CML患者的死亡风险为
The relative risk of dying relative to the non- GvHD non- CML group is simply as the PI for that group is zero. Thus the risks of dying in the other groups relative to non- GvHD non- CML patients are

非GvHD CML 0.08,GvHD 非CML 10.03,GvHD CML 0.82
non- GvHD CML 0.08 GvHD non- CML 10.03 GvHD CML 0.82

(c) 95% 置信区间由区间 给出,即 3.16 到 31.9。
(c) The CI is given by the range to , or 3.16 to 31.9.

(d) 基于如此小样本的Cox模型极其不可靠,正如上述宽置信区间所示。Cox分析的统计效能取决于“事件”的数量,这里指死亡事件,而非受试者人数。
(d) A Cox model based on such a small sample would be extremely unreliable, as is indicated by the wide CI given above. It is the number of 'events', here deaths, that determines the power of a Cox analysis, not the number of subjects.

第14章 CHAPTER 14

【14】1 (a) Wilcoxon检验(或检验)评估两种方法获得的值在平均水平上是否存在差异。同时必须考虑两种方法对个体患者的一致性,这一点假设检验无法完成。
14.1 (a) The Wilcoxon test (or the test) assesses whether the values obtained by the two methods differ on average. It is essential also to consider how well they agree for individual patients, which cannot be done by a hypothesis test.

(b) 简单分析基于计算两种方法差值的均值和标准差,从而得出95%一致性限,并绘制差值与两值平均数的散点图。
(b) A simple analysis is based on calculating the limits of agreement from the mean and SD of the differences between the two methods, and by plotting the differences against the average of the two values.

差值( - 生物素)的均值和标准差分别为 ,因此95%一致性限为 。这一非常宽的范围被Wilcoxon检验的非显著结果所掩盖。图A14.1展示了差值与两值平均数的关系图,未见差值大小随红细胞容量变化的证据。95%一致性限以实线表示。
The mean and SD of the differences ( - biotin) are and respectively, so the limits of agreement are to . This very wide range is completely disguised by the non- significant Wilcoxon test. Figure A14.1 shows the differences plotted against the average of the two values. There is no evidence that the magnitude of the differences varies with red cell volume. The limits of agreement are shown as solid lines.

(c) 这些患者的红细胞容量可能系统性地不同于健康人群。这并不必然意味着在红细胞容量范围不同的其他人群中,两种方法的一致性会同样差或同样好。
(c) The red cell volumes of these patients may be systematically different from those in the healthy population. It does not necessarily follow that the methods would agree equally badly (or equally well) in a different population with a different range of red cell volumes.

(d) 如果某种方法受食用鸡蛋影响,则建议排除食用过鸡蛋的患者。相反,仅仅因为某些患者…
(d) If one method is affected by consumption of eggs then it would be advisable to exclude patients who had eaten eggs. In contrast, it would be completely invalid to omit some patients simply because

他们的数据存在差异。作者的评论令人好奇,因为他们没有说明哪位患者吃过鸡蛋。如果排除两名疑似数值异常的患者,两种方法间差异的均值和标准差变为 ,有了显著改善。修正后的协议限如图 A14.1 中的虚线所示。
their data were discrepant. The authors' comment is curious, as they do not say which of the patients had eaten an egg. If the two patients with suspect values are excluded, the mean and SD of the between method differences become and , a considerable improvement. The revised limits of agreement are shown in Figure A14.1 as dashed lines.

【14】2 (a) 均值绘制在图 A14.2 中。两组的均值显示出非常相似的模式。
14.2 (a) The means are plotted in Figure A14.2. The means for the two groups show a very similar pattern.

(b) 两组的峰值均值和标准差以及曲线下面积的均值和标准差如下:
(b) The mean and SD of the peak values and areas under the curves for the two groups are

类风湿关节炎对照组
患者标准差标准差
均值标准差
峰值39.969.1346.1210.58
曲线下面积120.0533.61154.9348.23
Rheumatoid arthritisControls
patientsSDSD
MeanSD
Peak39.969.1346.1210.58
AUC120.0533.61154.9348.23

两组可以通过两样本 检验比较,分别得到 )和 )。(Mann-Whitney 检验结果非常相似。)因此,有一定证据表明类风湿关节炎患者的曲线下面积较低。
The two groups can be compared by two sample tests, which give ( ) and ( ) respectively. (Mann- Whitney tests give very similar results.) There is thus some evidence that the area under the curve is lower among patients with rheumatoid arthritis.

(c) 图 A14.3 中显示的个体曲线变化较大—许多曲线与平均曲线差异明显。均值是否能很好地代表整体模式,仍需判断。
(c) The individual curves shown in Figure A14.3 show considerable variation - many look very different from the mean curves. It is a matter of judgement whether the means are a good representation of the overall pattern.


图 A14.3 时间(小时)
Figure A14.3 Time (hours)

【14】3 (a) 三分之二承认吸毒的使用者将不允许献血。在剩余的三分之一中,预计有 0.24(24%)会通过检测,从而献血。在非吸毒者中,预计有 0.63(63%)会通过检测并献血。
14.3 (a) The two thirds of drug users who admitted the fact would not be allowed to give blood. Among the other third we would expect 0.24 (24%) to pass the test, and thus give blood. Among the non- drug users we would expect 0.63 (63%) to pass the test and give blood.

因此,献血者中吸毒者的比例为
Thus the proportion of blood donors who would be drug users is

(b) 在三分之二的说谎的药物使用者中,我们预计有0.76(76%)会未通过测谎测试。在非药物使用者中,我们也预计有0.37(37%)会未通过测试。因此,未通过测试者中药物使用者的预期比例为
(b) Among the two thirds of drug users who lied, we would expect 0.76 (76%) to fail the polygraph test. Among non-drug users we would expect 0.37 (37%) to fail the test too. Thus the expected proportion of drug users among those failing the test is

换句话说,几乎所有被测试拒绝的人(约96%)都是非药物使用者。为了筛查出一个药物使用者,必须错误地拒绝大约27名真正的捐赠者。
In other words, almost all of those rejected by the test (about 96%) would be non- drug users. To pick up one drug user it would be necessary to reject falsely about 27 genuine donors.

14.4 (a) 以呼吸频率小于或等于30、40、50或60次/分钟作为截断值,结果如下:
14.4 (a) Taking the cut- off as respiratory rates less than or equal to 30, 40, 50 or 60 breaths/min gives

截断值敏感性特异性
30141/142 = 99%16/151 = 11%
40137/142 = 96%93/151 = 62%
50127/142 = 89%139/151 = 92%
6086/142 = 61%148/151 = 98%
Cut-offSensitivitySpecificity
30141/142 = 99%16/151 = 11%
40137/142 = 96%93/151 = 62%
50127/142 = 89%139/151 = 92%
6086/142 = 61%148/151 = 98%

最佳截断值为50次/分钟,整体正确评估率为 的婴儿。
The best cut- off is 50 breaths/min, with an overall correct assessment for of infants.

(b) 使用第14.4.5节中给出的阳性预测值(PPV)和阴性预测值(NPV)公式,计算得到所需值为:
(b) Using the formulae for the positive and negative predictive values (PPV and NPV) given in section 14.4.5, the required values are

截断值PPVNPV
303.3%99.8%
407%99.8%
5026%99.6%
6049%98.8%
Cut-offPPVNPV
303.3%99.8%
407%99.8%
5026%99.6%
6049%98.8%

在低患病率情况下,如本例所示,NPV通常无太大帮助。PPV随着截断值的升高而增加,因此最大化正确预测的“最佳”选择是60次/分钟的截断值。然而,如我们所见,使用该截断值会漏诊近一半的下呼吸道感染(LRI)婴儿。50次/分钟的截断值几乎能识别出所有LRI婴儿,但相比60次/分钟的截断值,测试识别出的婴儿中假阳性(无LRI)会更多。需注意的是,研究样本中约有一半患有LRI—他们是住院患者,因此不代表急性呼吸道感染婴儿的一般人群,而后者的LRI患病率要低得多。
With low prevalence, as here, the NPV is usually not helpful. The PPV increases as the cut off level rises, so the 'best' choice to maximize correct predictions is a cut- off of 60 breaths/min. However, as we have seen, with this cut- off nearly half of the infants with LRI would be missed. A cut- off of 50 breaths/min would mean that nearly all infants with LRI would be identified, but that more of the infants identified by the test would not have LRI (false positives) compared with a cut- off of 60 breaths/min. Note that about half of the study

sample had LRI - they were inpatients and thus unrepresentative of the general population of infants with acute respiratory infection where the prevalence of LRI is much lower.

(c) 阳性预测值(PPV)是指呼吸频率超过 的婴儿中患有下呼吸道感染(LRI)的比例,为 。因此,如果所有呼吸频率超过该临界值的婴儿都接受抗生素治疗,则其中有 的婴儿将被“不必要地”治疗。该临界值下的敏感度为 ,意味着 的LRI婴儿会接受抗生素治疗,而 则不会。在较低的临界值 40 次/分钟下,接受抗生素治疗的LRI婴儿比例会更高( ),但接受治疗的婴儿中只有 是LRI患者。显然,临界值的选择必须基于非统计学的考虑。
(c) The PPV is the proportion of those infants with respiratory rate who have LRI, which is . Thus if all infants with a respiratory rate above the cut-off are treated with antibiotics, of them will have been treated 'unnecessarily'. The sensitivity is at this cut-off, which means that of LRI infants would receive antibiotics. Thus would not get antibiotics. At the lower cut-off of 40 breaths/min the proportion with LRI treated with antibiotics would be rather higher but only of those treated would have LRI. It should be clear that the choice of cut-off must be made on non-statistical considerations.

(d) 如果所有超过临界值的婴儿都接受治疗,则 的LRI婴儿和 的上呼吸道感染(URI)婴儿将被治疗。假设LRI患儿比例为 ,则接受治疗的总体比例为 。因此,只有 的婴儿会接受抗生素治疗,这将大幅节省成本。即使使用 40 次/分钟的临界值,接受治疗的比例也为 ,成本也仅为现有政策的一半。
(d) If all infants above the cut-off are treated then of LRI infants and of URI infants would be treated. Taking the proportion with LRI as , the proportion treated would be . Thus only of infants would be treated with antibiotics, representing an enormous saving in cost. Even using a cutoff of 40 breaths/min the proportion treated would be , which would cost half as much as the existing policy.

14.5 (a) 对于每对牙医,每颗牙齿可以通过汇总表中相关行归类为 SS、SC、CS 或 CC。由此得到的 表格可以通过计算 kappa 值进行评估,具体如下:
14.5 (a) For each pair of dentists each tooth can be categorized SS, SC, CS or CC by aggregating the relevant rows of the table. The resulting tables can be assessed by calculating kappa, as follows:

SSSCCSCCkappa% agree
1 v 235202801232160.4697%
1 v 321821348432960.1864%
2 v 321641209614350.2667%
SSSCCSCCkappa% agree
1 v 235202801232160.4697%
1 v 321821348432960.1864%
2 v 321641209614350.2667%

表中还显示了百分比一致率。因此,牙医1和2之间的一致性最好。牙医3认为龋齿数量(1644颗)远多于另外两位观察者(分别为339和496颗)。(b) kappa值为0.46通常不被认为是一致性特别好,但牙医1和2对检查的牙齿达成了约 的一致。这种差异是因为他们都认为绝大多数牙齿是健康的。在至少有一位牙医(1或2)认为龋齿的619颗牙齿中,他们仅就 颗(35%)达成一致。良好的一致性取决于具体情况,而非kappa值(更不用说P值—由于样本量巨大,上述所有kappa值均具有高度统计学显著性)。
Also shown is the percentage agreement. Thus the best agreement is between dentists 1 and 2. Dentist 3 considered that there were many more carious teeth (1644) than the other two observers (339 and 496). (b) A kappa value of 0.46 is not usually considered to be especially good agreement, but dentists 1 and 2 agreed about of the teeth examined. This discrepancy is because they both considered the large majority of teeth to be sound. Among the 619 teeth which at least one of dentists 1 and 2 considered carious, they agreed on only . Good agreement depends upon circumstances, not upon the kappa value (and certainly not upon the P value - because of the huge sample size, all the above kappa values are highly statistically significant).

第15章 CHAPTER 15

15.1 治疗组基线变量中可能影响患者预后的差异可能影响试验结果。重要的是不平衡的程度—统计显著性无关紧要。在表3.5中,除了初诊时的疼痛外,其他差异都微不足道。由于疼痛评分也是试验的结局指标,这种不平衡可能很重要。因此,分析疼痛的基线变化而非试验结束时的值可能更合理。(实际上,该研究作者指出基线疼痛评分并无预后价值。)当预后价值未知的变量存在不平衡时,可以根据该变量的值分别检验各治疗组的结局。通过回归分析调整预后变量的不平衡以比较治疗组,详见15.4节。
15.1 Differences in baseline values in the treatment groups for variables which might affect patient prognosis could affect the result of the trial. It is the magnitude of the imbalance that is important - statistical significance is irrelevant. In Table 3.5, the differences are trivial except perhaps for pain at presentation. As pain score was also the outcome measure for the trial this imbalance could be important. It may therefore be reasonable to analyse the change in pain from baseline rather than the value at the end of the study. (In fact, in this study the authors noted that the baseline pain score was not prognostic.) When there is imbalance in a variable for which the prognostic value is unknown, the outcome can be examined in relation to the values of that variable for each treatment group. Imbalance in a prognostic variable is handled by adjusting the comparison of treatment groups using regression analysis, see section 15.4.

15.2 (a) 只有分配到阿普洛洛尔组的患者会因β受体阻滞剂禁忌而退出。因此,两个组除了大小不等外,还不可比,试验结果将不可靠。
15.2 (a) Only those allocated alprenolol would have been withdrawn because of contraindication for the beta- blocker. The two groups, apart from being of unequal size, would thus have been non- comparable, and the trial results would be unsound.

(b) 不可。
(b) No.

(c) 通过直到开始治疗前才随机分配治疗(即随机化)。
(c) By not allocating treatments (randomizing) until immediately before starting treatment.

15.3 (a) 这是一个无对照试验,这并不是评估治疗效果的正确方法。该研究的另一个很差的特点是研究是“开放式”的,因此患者知道自己何时服用了阿司匹林。此外,考虑到测量结果高度变异,研究样本量极小。
15.3 (a) This was an uncontrolled trial, which is not a proper way to evaluate a treatment. Another very poor feature of this study was that the study was 'open', so that the patients knew when they were taking aspirin. Further, the study was extremely small, especially bearing in mind that the measurements were highly variable.

(b) 标准差通过 计算,三组观察值分别为36.1、26.7和19.8。因此,数据呈偏态分布,尤其是在治疗前,因为标准差超过均值的一半。
(b) The standard deviations are obtained as , or 36.1, 26.7 and 19.8 for the three sets of observations. The data are thus skewed, especially before treatment, because the SD is more than half the mean.

(c) 论文未说明数据是如何分析的(如果有分析的话)。合适的分析方法是配对Wilcoxon检验以比较任意两组数据,或者如果差异近似正态分布,则采用配对检验。也可使用双因素方差分析同时比较三组数据。然而,如前所述,这些分析价值有限,因为缺乏对照组。
(c) The paper gives no indication about how the data were analysed (if at all). An appropriate analysis would be a paired Wilcoxon test to compare any two sets of data, or a paired test if the differences were reasonably Normal. Two way analysis of variance could be used to compare all three groups simultaneously. As noted, however, these analyses would be of limited value because there was no comparison group.

(d) 受试者内变化的均值和标准差(或标准误)将非常有价值,变化的置信区间同样重要。
(d) The mean and SD (or SE) of the within-subject changes would be valuable, as would a CI for the changes.

(e) 基于之前的理由,该研究设计不适合回答问题。数据确实显示了一些改善,尽管暗示这些改善在统计上不显著。置信区间将非常宽。完全错误地认为基于这项小规模研究可以排除阿司匹林可能有效的可能性。
(e) For the reasons already given, the design of this study was inappropriate to answer the question. The data do in fact show some improvement, although it is implied that this was not statistically

(续)
significant. Confidence intervals would be very wide. It is totally wrong to suggest that on the basis of this small study the possibility of an effect of aspirin can be excluded.

15.4 (a) 两组的配对检验结果分别为 (8自由度;)和 (6自由度;)。仅一组出现高度显著变化可能表明两组之间确实存在差异。然而,这些是组内分析,而临床试验的核心在于直接比较组间差异。我们应使用两样本检验比较治疗一周后的数值或收缩压变化,而非通过独立分析的值间接比较。作者的解释不成立。
15.4 (a) Paired tests for the two groups give (on 8 df; ) and (on 6 df; ). The highly significant changes in one group only may suggest that there is indeed a difference between the groups. However, these are within group analyses whereas the whole point of a clinical trial is to compare the groups directly. We should do this using a two sample test on the values after one week of treatment or on the changes in systolic blood pressure, not via an indirect comparison of values from independent analyses. The authors' interpretation is not valid.

(b) 两组变化的均值(标准差)分别为12.67(8.99)和4.71(7.91),检验结果为 (14自由度;)。同样,一周后血压比较为 (14自由度;)。因此,有一定弱证据表明两组可能存在差异—作者的结论未被正确分析支持。
(b) The means (SD) of the changes for the two groups are 12.67 (8.99) and 4.71 (7.91) respectively, and the test gives (14 df; ). (Likewise, a comparison of the one week blood pressures gives (14 df; ).) There is thus some weak evidence that the groups may differ - the conclusion drawn by the authors is not supported by a correct analysis.

15.5 (a) 如15.3节所述,此处的标准化差异为
15.5 (a) As described in section 15.3, the standardized difference here is

使用图15.2中的列线图,所需的样本量为600(每组300)。
Using the nomogram in Figure 15.2, the required sample size is 600 (300 per group).

(b) 大约为
(b) About .

(c) 标准化差异为0.7时,样本量为65可达到 的检验效能。假设安慰剂组高血压的风险为0.3,则有
(c) A standardized difference of 0.7 gives power with a sample of size 65. Taking the risk of hypertension in the placebo group as 0.3, we have

其中 是阿司匹林组发生高血压的比例, 是0.3和 的平均值。该方程可通过数学方法或反复试验求解,结果为 。显然,该试验样本量过小,除非治疗效果非常显著,否则难以检测出治疗的益处。
where is the proportion developing hypertension in the aspirin group and is the average of 0.3 and . This equation can be solved mathematically or by trial and error; the answer is . The trial was clearly too small to have a good chance of detecting all but a very large benefit of treatment.

(d) 采用Yates校正的卡方检验得 )。文中给出的 值是单侧的,但未加以说明。因此结果仅边缘显著。相对风险(RR)为0.33,95%置信区间非常宽,从0.12到0.93。作者谨慎解读结果,并建议需要更大规模的试验以确认阿司匹林在此情境下的益处。
(d) The Chi squared test with Yates' correction gives ( ). The value quoted is one-sided, although there is no comment to that effect in the paper. The result is thus only marginally significant. The RR of 0.33 has a very wide CI from 0.12 to 0.93. The authors are right to interpret their findings cautiously, and to suggest that a larger trial would be needed to confirm (or not) the benefit of aspirin in this setting.

第16章 CHAPTER 16

【16】1 值是反对原假设证据强度的度量,不代表观察效应的大小。两个不同规模的临床试验可能观察到相同的治疗效果,但 值不同。对于连续变量, 值还取决于变异性(参见练习9.6)。
16.1 The value is a measure of the strength of evidence against the null hypothesis. It does not indicate the magnitude of the observed effect. Two clinical trials of different sizes may yield the same treatment effect but different values. For continuous variables the value also depends upon the variability (see exercise 9.6).

【16】2 如果不将“相同”理解为样本量相同,那么一个原因可能是研究患者数量的差异。两个相同规模的研究不太可能得出完全相同的结果。研究规模越大,观察效应(及 值)越趋近。大型研究结果差异显著,可能说明研究并非如声称的那样“相同”,例如不同国家患者或实验室间存在差异。
16.2 If we do not consider 'identical' to refer to sample size, then one reason could be variation in the numbers of patients studied. Two identical studies of the same size would not be expected to yield exactly the same results. The observed effects (and values) would tend to be closer if the studies were large than if they were small. Dramatically different results for large studies may mean that the studies were not as 'identical' as claimed. For example, there may be differences between patients in different countries or between laboratories.

【16】3 (a) 假设服从正态分布,男性平均身高可表示为(179.1 - 171.7)/ 5.75 个标准差,高于女性平均身高。该值为1.287。根据表B1,标准正态分布中1.25和1.30的上尾概率分别为0.1056和0.0968,因此所需值为0.10,即
16.3 (a) Assuming a Normal distribution, the average height of men can be expressed as (179.1 - 171.7)/5.75 standard deviations above the mean height of women. This value is 1.287. From Table B1 the upper tail areas corresponding to standard Normal deviates of 1.25 and 1.30 are 0.1056 and 0.0968, so the required value is 0.10 or .

(b) 一个男性身高超过 的概率是正态分布上对应于 的上尾面积,根据表B1,该概率为0.2578。对于女性,我们需要对应于 的尾部面积,为0.0256。如果成年人中60%是女性,那么身高超过 的成年人中女性所占比例为
(b) The probability of a man being taller than is the upper tail area of the Normal distribution corresponding to , which from Table B1 is 0.2578. For women we require the tail area corresponding to , which is 0.0256. If of adults are women the proportion of adults taller than who are women is given by


or .

16.4 人口的年龄结构发生了显著变化,老年人比例远高于过去。总死亡率是通过将年龄特异性死亡率与相应风险人数相乘计算得出,尽管年龄特异性死亡率有所下降,但由于老年人口数量增加,总死亡率保持不变。
16.4 The age- structure of the population has changed markedly, with the proportion of older people being much higher than it was. The total death rate, which is calculated by multiplying the age- specific rates and the numbers at risk, is unchanged because the age- specific reductions in rates are counterbalanced by the greater numbers of elderly people.

16.5 (a) 不,因为这些数字是百分比。虽然这是比较比例(或百分比)的方法,但卡方检验必须基于频数进行。
16.5 (a) No, because the figures are percentages. Although it is a method of comparing proportions (or percentages) the Chi squared test must be performed on frequencies.

(b) 百分比分别为 ,差异为 。95%置信区间为15%至27%。由于样本较大,区间较窄。
(b) The percentages are and , so the difference is . The confidence interval is to . Because the sample is large the interval is quite narrow.

(c) 评估趋势最简单的方法是使用趋势卡方检验。合理地给三个社会阶层组赋予等距分值(如 、0 和 1)。未分类的儿童应被排除。比较的比例已在问题开头给出。卡方检验结果为
(c) The simplest way to assess a trend is to use the Chi squared test for trend. It is reasonable to give equally spaced scores to the three social class groups (such as , 0 and 1). Those children whose social class was unclassified must be excluded. The proportions being compared were shown at the beginning of the problem. The Chi squared tests give and .

(d) 尽管一个组的趋势在5%显著性水平下显著,而另一个组不显著,但我们不应推断该关系仅存在于无氟化区域。比较趋势的最简单方法是使用回归方法估计社会阶层变化对应的 % dmft 变化(见11.15.2节),并比较斜率(见11.12.1节)。或者,可以对所有数据进行复杂的逻辑回归分析。
(d) Although the trend is significant (at the level) in one group but not the other, we should not infer that the relation is present only within the non-fluoridated area. The simplest way to compare the trends is to use the regression approach to estimate the change in dmft for each change in social class category (section 11.15.2) and compare the slopes (section 11.12.1). Alternatively, all the data could be analysed at once in a complicated logistic regression analysis.

(e) 这种效应异质性的技术术语是“交互作用”。
(e) The technical term for such a heterogeneity of effect is 'interaction'.

16.6 (a) 画出间接测量与直接测量差值与其平均值的图,未见差值随血压水平变化的趋势。差值的均值和标准差分别为 和 12.46,因此95%一致性限为
16.6 (a) A graph of the differences between the indirect and direct measurements against their average shows no tendency for the differences to be related to the level of blood pressure. The mean and SD of differences were and 12.46 respectively, so limits of agreement are to .

(b) 值 分别是距离均值 和 1.18 个标准差。假设差值来自正态分布,则差值绝对值大于10的概率是小于 的概率加上大于10的概率,根据表B1,这一概率为 ,约为 。另一种方法是以观察到的比例作为估计,即 ,或 。(当差值服从正态分布时,此方法的可靠性较低。)
(b) The values and are respectively and 1.18 SDs from the mean. Assuming that the differences come from a Normal population, the probability of a difference larger than 10 in either direction is the probability of being less than plus the probability of being greater than 10, which (from Table B1) is , or about . Alternatively, we can take the observed proportion as an estimate, which is or . (This method is less reliable when the differences are Normal.)

(c) Pearson相关系数分别为:与体重0.32 ,与臂围0.18 ,与年龄 。因此,有一定证据表明直接测量与间接测量之间的差异随体重增加而增大。回归线的斜率为0.394 ,表明每增加10公斤体重,两种方法的差异估计增加约
(c) Pearson correlation coefficients are 0.32 with weight, 0.18 with arm circumference and with age. There is thus some evidence that the discrepancy between the direct and indirect measurements increases with body weight. The slope of the regression line is 0.394 , indicating an estimated increase in the difference between the methods of per additional body weight.

(d) 这可能是新直接测量方法的“学习效应”所致。差异与测量顺序的秩相关系数为0.37
(d) It could be due to a 'learning effect' with the new direct method of measurement. The rank correlation between the differences and the order of measurement is 0.37 .

(e) 最后40名女性测量差异的均值和标准差分别为 ,因此 一致性限为 和 20.01。这些限值与全部50名女性数据所得的限值相差不大。
(e) The Mean and SD of the differences between the measurements from the last 40 women are and , so that the limits of agreement are and 20.01. These limits are not much narrower than those derived using the data from all 50 women.

16.7 (a) 三天的数据可以用双因素方差分析检验。三天的均值和标准差如下:
16.7 (a) The data for the three days can be examined by a two way

analysis of variance. The means and SDs for the three days are:

均值标准差
第1天151.2762.37
第2天194.4078.77
第3天111.4761.14
MeanSD
Day 1151.2762.37
Day 2194.4078.77
Day 3111.4761.14

由于标准差相近,参数分析的假设可能成立,但拟合模型后应检查残差。方差分析中三次测量的比较得 ,自由度为2和28,。Shapiro-Francia 残差正态性检验为 ,表明残差分布非常接近正态。
As the SDs are similar it is likely that the assumptions for a parametric analysis will be met, but the residuals should be checked after fitting the model. The comparison of the three times within the analysis of variance gives on 2 and 28 degrees of freedom . The Shapiro- Francia test of the residuals from this model gives , showing that the residuals have a distribution very close to Normal.

由于各天间差异显著,有理由用基于方差分析残差标准差(59.7175 mmol/l)的标准误进行 检验比较各天两两之间的差异。检验结果如下:
As there is highly significant variation among the days it is reasonable to examine each pair of days using tests with the SE based on the residual SD from the analysis of variance (which is 59.7175 mmol/l). These tests give:

差异t值P值调整后P值*
第1天 vs 第2天43.131.980.060.17
第1天 vs 第3天-39.80-1.820.080.24
第2天 vs 第3天-82.93-3.800.00070.002
DifferencetPP*
Day 1 v Day 243.131.980.060.17
Day 1 v Day 3-39.80-1.820.080.24
Day 2 v Day 3-82.93-3.800.00070.002

其中, 乘以3(Bonferroni校正)。第2天和第3天之间的差异即使经过Bonferroni校正后仍高度显著。其他差异不显著,但血浆醛固酮的平均变化相当大。
where is multiplied by 3 (the Bonferroni adjustment). Days 2 and 3 are highly significantly different, even after the Bonferroni correction. The other differences are not significant, but the mean changes in plasma aldosterone are quite large.

(b) 残差标准差也应用于构建任意两组均值差异的置信区间。第1天和第2天均值差异的95%置信区间为 ,即从 。这个较宽的置信区间表明,血浆醛固酮可能确实随着从低海拔到高海拔的快速变化而发生变化,但需要更大规模的研究来验证这一可能性。
(b) The residual SD should also be used to construct a CI for the difference between any pair of means. The CI for the difference between the means on days 1 and 2 is given by , or to . This wide CI suggests that there may well be a real change in plasma aldosterone associated with a rapid change from low to high altitude, but a larger study would be needed to investigate this possibility.

(c) 高山病评分(AMS)与血浆醛固酮变化的相关系数为 )。几乎没有证据表明这两个变量之间存在关联。
(c) The correlation coefficient between the mountain sickness score (AMS) and the change in plasma aldosterone is . There is little evidence that the two variables are related.

【16】8 不,这不是一个合理的论点。确实置信区间会很宽,这并非毫无意义,而是表明研究样本量过小,无法得出精确的结论。
16.8 No, this is not a reasonable argument. It is true that the confidence interval will be wide. This is not meaningless, but rather indicates that the study was too small to enable precise conclusions to be drawn.

参考文献 REFERENCES

Ahlmark, G. 和 Saetre, H. (1976) 心肌梗死后长期使用β受体阻滞剂治疗。欧洲临床药理学杂志,10,77-83。[示例15.2]
Ahlmark, G. and Saetre, H. (1976) Long- term treatment with beta- blockers after myocardial infarction. Eur. J. Clin. Pharmacol., 10, 77- 83. [Ex 15.2]

Albert, D. A. (1981) 判断研究结论是否合理的综述。医学决策制定,1,265-275。[16.1]
Albert, D. A. (1981) Deciding whether the conclusions of studies are justified: a review. Med. Decision Making, 1, 265- 75. [16.1]

Altman, D. G. (1982a) 统计学的误用是不道德的。载于《实践中的统计学》(编辑 S. M. Gore 和 D. G. Altman),伦敦,英国医学会,1-2页。[15.2.9, 16.1]
Altman, D. G. (1982a) Misuse of statistics is unethical. In Statistics in Practice (eds S. M. Gore and D. G. Altman), London, British Medical Association, 1- 2. [15.2.9, 16.1]

Altman, D. G. (1982b) 样本量应多大?载于《实践中的统计学》(编辑 S. M. Gore 和 D. G. Altman),伦敦,英国医学会,6-8页。[15.3.2]
Altman, D. G. (1982b) How large a sample? In Statistics in Practice (eds S. M. Gore and D. G. Altman), London, British Medical Association, 6- 8. [15.3.2]

Altman, D. G. (1982c) 提高医学期刊中统计质量。载于《实践中的统计学》(编辑 S. M. Gore 和 D. G. Altman),伦敦,英国医学会,21-24页。[16.3.10]
Altman, D. G. (1982c) Improving the quality of statistics in medical journals. In Statistics in Practice (eds S. M. Gore and D. G. Altman), London, British Medical Association, 21- 4. [16.3.10]

Altman, D. G. (1985) 随机分组的可比性。统计学家,34,125-136。[15.4.1, 15.4.6]
Altman, D. G. (1985) Comparability of randomised groups. Statistician, 34, 125- 36. [15.4.1, 15.4.6]

Altman, D. G. 和 Bland, J. M. (1991) 提高医生对统计学的理解。皇家统计学会学报A(待发表)。[16.3.9]
Altman, D. G. and Bland, J. M. (1991) Improving doctors' understanding of statistics. J. Roy. Statist. Soc., A. (in press). [16.3.9]

Altman, D. G. 和 Coles, E. C. (1980) 连续尺度上评估胎龄对应的出生体重。人体生物学年鉴,7,35-44。[11.12.2, 14.5.4]
Altman, D. G. and Coles, E. C. (1980) Assessing birth weight- for- dates on a continuous scale. Ann. Hum. Biol., 7, 35- 44. [11.12.2, 14.5.4]

Altman, D. G. 和 Dore, C. J. (1990) 临床试验中的随机化和基线比较。柳叶刀,335,149-153。[16.3.7]
Altman, D. G. and Dore, C. J. (1990) Randomisation and baseline comparisons in clinical trials. Lancet, 335, 149- 53. [16.3.7]

Altman, D. G. 和 Gardner, M. J. (1989) 回归和相关性的置信区间。载于《带置信度的统计》(主编 M. J. Gardner 和 D. G. Altman),伦敦,英国医学期刊,34-49页。[11.12.1]
Altman, D. G. and Gardner, M. J. (1989) Confidence intervals for regression and correlation. In Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London, British Medical Journal, 34- 49. [11.12.1]

Altman, D. G., Gore, S. M., Gardner, M. J. 和 Pocock, S. J. (1989) 医学期刊投稿者的统计指南。载于《带置信度的统计》(主编 M. J. Gardner 和 D. G. Altman),伦敦,英国医学期刊,83-100页。[15.6.1, 16.5]
Altman, D. G., Gore, S. M., Gardner, M. J. and Pocock, S. J. (1989) Statistical guidelines for contributors to medical journals. In Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London, British Medical Journal, 83- 100. [15.6.1, 16.5]

Altman, D. G. 和 Johnson, A. L. (1990) 医学文献中统计质量评审调查。III. 结果(筹备中)[16.3.1, 16.3.2]
Altman, D. G. and Johnson, A. L. (1990) A survey of reviews of the quality of statistics in the medical literature. III. Findings (in preparation) [16.3.1, 16.3.2]

Altman, D. G. 和 Royston, J. P. (1988) 时间的隐性影响。统计医学,7,629-637页。[7.7.2]
Altman, D. G. and Royston, J. P. (1988) The hidden effect of time. Stat. Med., 7, 629- 37. [7.7.2]

Amess, J. A. L., Burman, J. F., Rees, G. M., 等 (1978) 接受笑气治疗患者的巨幼红细胞造血。柳叶刀,ii,339-342页。[9.8.2]
Amess, J. A. L., Burman, J. F., Rees, G. M., et al. (1978) Megaloblastic haemopoiesis in patients receiving nitrous oxide. Lancet, ii, 339- 42. [9.8.2]

Andreasson, S., Allebeck, P., Engstrom, A. 和 Rydberg, U. (1987) 大麻与精神分裂症。瑞典应征者的纵向研究。柳叶刀,ii,1483-1486页。[5.11.2, 5.14]
Andreasson, S., Allebeck, P., Engstrom, A. and Rydberg, U. (1987) Cannabis and schizophrenia. A longitudinal study of Swedish conscripts. Lancet, ii, 1483- 6. [5.11.2, 5.14]

匿名 (1937) 数学与医学。柳叶刀,i,31页。[16.3.9]
Anon (1937) Mathematics and medicine. Lancet, i, 31. [16.3.9]

匿名 (1954) 编号。英国医学杂志,1,1314页。[16.2]
Anon (1954) Numbering off. Br. Med. J., 1, 1314. [16.2]

匿名 (1978) 不会消失的异常。柳叶刀,ii,978页。[11.17]
Anon (1978) The anomaly that wouldn't go away. Lancet, ii, 978. [11.17]

Apgar, V. (1953) 新生儿评估新方法的提议。麻醉与镇痛,32,260-267页。[2.4.4]
Apgar, V. (1953) Proposal for new method of evaluation of newborn infants, Anesth. Analg., 32, 260- 7. [2.4.4]

Apgar, V., Holaday, D. A., James, L. S., 等(1958)新生儿评估—第二报告。美国医学会杂志,168,1985-1988。[2.4.4]
Apgar, V., Holaday, D. A., James, L. S., et al. (1958) Evaluation of the newborn infant - second report. J. Am. Med. Ass., 168, 1985- 8. [2.4.4]

Armitage, P. 和 Berry, G.(1987)医学研究中的统计方法,第2版
Armitage, P. and Berry, G. (1987) Statistical Methods in Medical Research, 2nd

版次,牛津:Blackwell出版社。[6.7, 7.6.1, 9.6.5, 9.8.2, 12.3.4, 13.2.2] Ayesh, R., Mitchell, S. C., Waring, R. H. 等(1987)类风湿关节炎患者中金硫代葡萄糖酸钠的毒性及硫氧化能力。英国风湿病学杂志,26,197-201。[示例3.1,示例10.5] Bachs, L., Parés, A., Elena, M. 等(1989)利福平与苯巴比妥治疗胆汁性肝硬化瘙痒症的比较。柳叶刀,i,574-576。[15.4.10] Bagot, M., Mary, J.-Y., Heslan, M. 等(1988)混合表皮细胞淋巴细胞反应是骨髓移植受者急性移植物抗宿主病最具预测性的因素。英国血液学杂志,70,403-409。[示例12.3] Bailar, J. C.(1986)与科学受众的交流。载于《医学统计的应用》(主编J. C. Bailar和F. Mosteller),马萨诸塞州沃尔瑟姆:NEJM图书,325-337。[16.5] Bailar, J. C., Louis, T. A., Lavori, P. W. 和 Polansky, M.(1984)生物医学研究报告的分类。新英格兰医学杂志,311,1482-1487。[5.2.4, 5.14] Baker, C. J., Kasper, D. L., Edwards, M. S. 和 Schiffman, G.(1980)预免疫抗体水平对相关多糖抗原免疫反应特异性的影响。新英格兰医学杂志,303,173-178。[示例9.2] Barnes, D. M., Lammie, G. A., Millis, R. R. 等(1988)人类乳腺癌中c-erbB-2表达的免疫组化评价。英国癌症杂志,58,448-452。[13.3.1] Begg, C. B.(1987)诊断试验评估中的偏倚。统计医学,6,411-423。[14.4.7] Begg, C. B. 和 Berlin, A. A.(1988)发表偏倚:医学数据解释中的问题(含讨论)。皇家统计学会A辑,151,419-463。[15.5.2] Begg, C. B. 和 Engstrom, P. F.(1987)癌症临床试验中的纳入标准与外推。临床肿瘤学杂志,5,962-968。[15.2.8] Begg, T. B. 和 Hearns, J. B.(1966)血液粘度的组成成分。红细胞压积、血浆纤维蛋白原及其他蛋白质的相对贡献。临床科学,31,87-93。[11.5] Bhargava, S. K., Ramji, S., Kumar, A. 等(1985)出生时中臂和胸围作为社区中低出生体重和新生儿死亡率的预测指标。英国医学杂志,291,1617-1619。[16.3.5] Blackwell, R. 和 Chang, A.(1988)视频显示终端与妊娠。综述。英国妇产科杂志,95,446-453。[示例4.4] Bland, J. M. 和 Altman, D. G.(1986)评估两种临床测量方法一致性的统计方法。柳叶刀,i,307-310。[11.3.4, 14.2, 14.2.2, 14.2.3, 14.2.6] Bland, J. M. 和 Altman, D. G.(1988)误导性统计:教科书、软件和手册中的错误。国际流行病学杂志,17,245-247。[6.3, 6.5] Bland, M.(1987)《医学统计学导论》,牛津:牛津大学出版社。[9.6.4, 13.2.2] Blomqvist, N.(1986)回归向均值偏倚对研究变化与初始值关系的影响。临床牙周病学杂志,13,34-37。[11.3.5] Booze, C. F.(1977)一般航空事故中职业、年龄和暴露的流行病学调查。航空航天环境医学,48,1081-1091。[3.1,示例3.2] Boyd, N. F., Wolfson, C., Moskowitz, M. 等(1982)乳腺X线摄影解读中的观察者变异。美国国家癌症研究所杂志,68,357-363。[14.3] Bracken, M. B.(1989)观察性研究报告。英国妇产科杂志,96,383-388。[16.4]
edn, Oxford: Blackwell. [6.7, 7.6.1, 9.6.5, 9.8.2, 12.3.4, 13.2.2] Ayesh, R., Mitchell, S. C., Waring, R. H., et al. (1987) Sodium aurothiomalate toxicity and sulphoxidation capacity in rheumatoid arthritic patients. Br. J. Rheumatol., 26, 197- 201. [Ex 3.1, Ex 10.5] Bachs, L., Parés, A., Elena, M., et al. (1989) Comparison of rifampicin with phenobarbitone for treatment of pruritis in biliary cirrhosis. Lancet, i, 574- 6. [15.4.10] Bagot, M., Mary, J.- Y., Heslan, M., et al. (1988) The mixed epidermal cell lymphocyte- reaction is the most predictive factor of acute graft- versus- host disease in bone marrow graft recipients. Br. J. Haematol., 70, 403- 9. [Ex 12.3] Bailar, J. C. (1986) Communicating with a scientific audience. In Medical Uses of Statistics (eds J. C. Bailar and F. Mosteller), Waltham, Mass.: NEJM Books, 325- 37. [16.5] Bailar, J. C., Louis, T. A., Lavori, P. W. and Polansky, M. (1984) A classification for biomedical research reports. N. Engl. J. Med., 311, 1482- 7. [5.2.4, 5.14] Baker, C. J., Kasper, D. L., Edwards, M. S. and Schiffman, G. (1980) Influence of preimmunization antibody levels on the specificity of the immune response to related polysaccharide antigens. N. Engl. J. Med., 303, 173- 8. [Ex 9.2] Barnes, D. M., Lammie, G. A., Millis, R. R., et al. (1988) An immunohistochemical evaluation of c- erbB- 2 expression in human breast carcinoma. Br. J. Cancer, 58, 448- 52. [13.3.1] Begg, C. B. (1987) Biases in the assessment of diagnostic tests. Stat. Med., 6, 411- 23. [14.4.7] Begg, C. B. and Berlin, A. A. (1988) Publication bias: a problem in interpreting medical data (with discussion). J. Roy. Statist. Soc. A., 151, 419- 63. [15.5.2] Begg, C. B. and Engstrom, P. F. (1987) Eligibility and extrapolation in cancer clinical trials. J. Clin. Oncol., 5, 962- 8. [15.2.8] Begg, T. B. and Hearns, J. B. (1966) Components in blood viscosity. The relative contributions of haematocrit, plasma fibrinogen and other proteins. Clin. Sci., 31, 87- 93. [11.5] Bhargava, S. K., Ramji, S., Kumar, A., et al. (1985) Mid- arm and chest circumferences at birth as predictors of low birth weight and neonatal mortality in the community. Br. Med. J., 291, 1617- 19. [16.3.5] Blackwell, R. and Chang, A. (1988) Video display terminals and pregnancy. A review. Br. J. Obstet. Gynaecol., 95, 446- 53. [Ex 4.4] Bland, J. M. and Altman, D. G. (1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307- 10. [11.3.4, 14.2, 14.2.2, 14.2.3, 14.2.6] Bland, J. M. and Altman, D. G. (1988) Misleading statistics: errors in textbooks, software and manuals. Int. J. Epidemiol., 17, 245- 7. [6.3, 6.5] Bland, M. (1987) An Introduction to Medical Statistics, Oxford: University Press. [9.6.4, 13.2.2] Blomqvist, N. (1986) On the bias caused by regression toward the mean in studying the relation between change and initial value. J. Clin. Periodontol., 13, 34- 7. [11.3.5] Booze, C. F. (1977) Epidemiologic investigation of occupation, age, and exposure in general aviation accidents. Aviat. Space Environ. Med., 48, 1081- 91. [3.1, Ex 3.2] Boyd, N. F., Wolfson, C., Moskowitz, M., et al. (1982) Observer variation in the interpretation of xeromammograms. J. Nat. Cancer Inst., 68, 357- 63. [14.3] Bracken, M. B. (1989) Reporting observational studies. Br. J. Obstet. Gynaecol., 96, 383- 8. [16.4]

Breslow, N. E. 和 Day, N. E.(1980)癌症研究中的统计方法。第一卷—病例对照研究分析,里昂:国际癌症研究机构。[5.10.6, 5.14, 10.11.2]
Breslow, N. E. and Day, N. E. (1980) Statistical Methods in Cancer Research. Volume 1 - The analysis of case- control studies, Lyon: IARC. [5.10.6, 5.14, 10.11.2]

Breslow, N. E. 和 Day, N. E. (1987) 《癌症研究中的统计方法》第二卷—队列研究的设计与分析,牛津:牛津大学出版社/IARC。[5.10.4, 5.11, 5.14]
Breslow, N. E. and Day, N. E. (1987) Statistical Methods in Cancer Research. Volume II - The design and analysis of cohort studies, Oxford: University Press/IARC. [5.10.4, 5.11, 5.14]

Brett, A. S., Phillips, M. 和 Beary, J. F. (1986) 测谎仪的预测能力:‘测谎器’真的能识别说谎者吗?Lancet,i,544-7。[示例 14.3]
Brett, A. S., Phillips, M. and Beary, J. F. (1986) Predictive power of the polygraph: can the 'lie detector' really detect liars? Lancet, i, 544- 7. [Ex 14.3]

Brostoff, J., Pack, S. 和 Merrett, T. (1984) 一种新的多重特异性IgE测定法—MAST。Lancet,i,748-9。[14.3.1,14.3.5]
Brostoff, J., Pack, S. and Merrett, T. (1984) A new multiple specific IgE assay - MAST. Lancet, i, 748- 9. [14.3.1, 14.3.5]

Brown, G. W. (1984) 判别分析。Am. J. Dis. Child,138,395-400。[12.6]
Brown, G. W. (1984) Discriminant analysis. Am. J. Dis. Child, 138, 395- 400. [12.6]

Bungo, M. W., Charles, J. B. 和 Johnson, P. C. (1985) 太空飞行期间的心血管功能减退及使用生理盐水作为直立性耐受不良的对策。Aviat. Space Environ. Med.,56,985-90。[示例 9.1]
Bungo, M. W., Charles, J. B. and Johnson, P. C. (1985) Cardiovascular deconditioning during space flight and the use of saline as a countermeasure to orthostatic intolerance. Aviat. Space Environ. Med., 56, 985- 90. [Ex 9.1]

Burns, K. C. (1984) 晕动症发生率:首次呕吐时间分布及复杂运动条件的比较。Aviat. Space Environ. Med.,50,521-7。[13.2.1,13.3.1]
Burns, K. C. (1984) Motion sickness incidence: distribution of time to first emesis and comparison of some complex motion conditions. Aviat. Space Environ. Med., 50, 521- 7. [13.2.1, 13.3.1]

Buyse, M. (1984) 多中心癌症临床试验中的质量控制。载于《癌症临床试验:方法与实践》(编者 M. E. Buyse, M. J. Staquet 和 R. J. Sylvester),牛津大学出版社,102-123 页。[7.7.8]
Buyse, M. (1984) Quality control in multi- centre cancer clinical trials. In Cancer Clinical Trials. Methods and Practice (eds M. E. Buyse, M. J. Staquet and R. J. Sylvester), Oxford: University Press, 102- 23. [7.7.8]

Campbell, M. J. 和 Gardner, M. J. (1989) 计算某些非参数分析的置信区间。载于《带置信度的统计学》(编者 M. J. Gardner 和 D. G. Altman),伦敦:英国医学杂志,71-79 页。[9.6.3, 答案 9.4]
Campbell, M. J. and Gardner, M. J. (1989) Calculating confidence intervals for some non- parametric analyses. In Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London: British Medical Journal, 71- 9. [9.6.3, Ans 9.4]

Campogrande, M., Todros, T. 和 Brizzolara, M. (1977) 通过超声测量预测胎儿出生体重。英国妇产科杂志,84,175-178。[11.16]
Campogrande, M., Todros, T. and Brizzolara, M. (1977) Prediction of birthweight by ultrasound measurements of the fetus. Br. J. Obstet. Gynaecol., 84, 175- 8. [11.16]

Carmichael, C. L., Rugg-Gunn, A. J. 和 Ferrell, R. S. (1989) 1987 年纽卡斯尔和诺森伯兰 5 岁儿童氟化、社会阶层与龋齿经历的关系。英国牙科杂志,167,57-61。[练习 16.5]
Carmichael, C. L., Rugg- Gunn, A. J. and Ferrell, R. S. (1989) The relationship between fluoridation, social class and caries experience in 5- year- old children in Newcastle and Northumberland in 1987. Br. Dent. J., 167, 57- 61. [Ex 16.5]

Caruana, M. P., Lahiri, A., Cashman, P. M. M., 等 (1988) 冠状动脉疾病继发的慢性充血性心力衰竭对血压和心率昼夜节律的影响。美国心脏病学杂志,62,755-759。[练习 7.1]
Caruana, M. P., Lahiri, A., Cashman, P. M. M., et al. (1988) Effects of chronic congestive heart failure secondary to coronary artery disease on the circadian rhythm of blood pressure and heart rate. Am. J. Cardiol., 62, 755- 9. [Ex 7.1]

Cavill, I., Trevett, D., Fisher, J. 和 Hoy, T. (1988) 人体红细胞总量的测定:一种使用生物素的非放射性方法。英国血液学杂志,70,491-493。[练习 14.1]
Cavill, I., Trevett, D., Fisher, J. and Hoy, T. (1988) The measurement of the total volume of red cells in man: a non- radioactive approach using biotin. Br. J. Haematol., 70, 491- 3. [Ex 14.1]

Centerwall, B. S., Armstrong, C. W., Funkhouser, L. S. 和 Elzay, R. P. (1986) 气体氯化泳池中竞技游泳者牙釉质的腐蚀。美国流行病学杂志,123,641-647。[10.7]
Centerwall, B. S., Armstrong, C. W., Funkhouser, L. S. and Elzay, R. P. (1986) Erosion of dental enamel among competitive swimmers at a gas- chlorinated swimming pool. Am. J. Epidemiol., 123, 641- 7. [10.7]

Chalmers, T. C., Smith, H., Blackburn, B., 等 (1981) 评估随机对照试验质量的方法。受控临床试验,2,31-49。[15.6.1, 16.4]
Chalmers, T. C., Smith, H., Blackburn, B., et al. (1981) A method for assessing the quality of a randomized control trial. Controlled Clin. Trials., 2, 31- 49. [15.6.1, 16.4]

Cherian, T., John, T. J., Simoes, E., 等 (1988) 评估急性下呼吸道感染诊断的简单临床体征。柳叶刀,ii,125-128。[练习 14.4]
Cherian, T., John, T. J., Simoes, E., et al. (1988) Evaluation of simple clinical signs for the diagnosis of acute lower respiratory tract infection. Lancet, ii, 125- 8. [Ex 14.4]

Christensen, E. (1987) 使用 Cox 回归模型的多变量生存分析。肝脏病学,7,1346-1358。[13.6.3]
Christensen, E. (1987) Multivariate survival analysis using Cox's regression model. Hepatology, 7, 1346- 58. [13.6.3]

Christensen, E., Neuberger, J., Crowe, J., 等 (1985) 硫唑嘌呤的有益作用及其对原发性胆汁性肝硬化预后的预测:一项国际试验的最终结果。胃肠病学 89, 1084-1091。[4.5, 4.6, 7.5.3, 7.7.2, 13.6.1]
Christensen, E., Neuberger, J., Crowe, J., et al. (1985) Beneficial effect of azathioprine and prediction of prognosis in primary biliary cirrhosis: final results of an international trial. Gastroenterology 89, 1084- 91. [4.5, 4.6, 7.5.3, 7.7.2, 13.6.1]

Clayton, D. 和 Hills, M. (1987) 两期交叉试验。载于:统计学
Clayton, D. and Hills, M. (1987) A two- period crossover trial. In: The Statistical

《Consultant in Action》(编辑 D. J. Hand 和 B. S. Everitt),剑桥:大学出版社,42-57页。[15.4.10]
Consultant in Action (eds D. J. Hand and B. S. Everitt), Cambridge: University Press, 42- 57. [15.4.10]

Cleveland, W. S. (1984) 科学出版物中的图表。美国统计学家,38,261-269。[16.3.5]
Cleveland, W. S. (1984) Graphs in scientific publications. Am. Stat., 38, 261- 9. [16.3.5]

Colditz, G. A. 和 Emerson, J. D. (1985) 已发表医学研究的统计内容:对生物医学教育的一些启示。医学教育,19,248-255。[16.2]
Colditz, G. A. and Emerson, J. D. (1985) The statistical content of published medical research: some implications for biomedical education. Med. Educ., 19, 248- 55. [16.2]

Collins, R., Gray, R., Godwin, J. 和 Peto, R. (1987) 避免在评估中等治疗效果时出现大的偏倚和大的随机误差:系统综述的必要性。Stat. Med., 6, 245-250. [15.4.1, 15.4.9, 15.5.2]
Collins, R., Gray, R., Godwin, J. and Peto, R. (1987) Avoidance of large biases and large random errors in the assessment of moderate treatment effects: the need for systematic overviews. Stat. Med., 6, 245- 50. [15.4.1, 15.4.9, 15.5.2]

Colton, T. (1974) 医学统计学,波士顿:Little, Brown出版社。 [1.3, 15.6.2, 16.4]
Colton, T. (1974) Statistics in Medicine, Boston: Little, Brown. [1.3, 15.6.2, 16.4]

Cooper, G. S. 和 Zangwill, L. (1989) 《普通内科杂志》研究报告质量分析。J. Gen. Intern. Med., 4, 232-236. [14.]
Cooper, G. S. and Zangwill, L. (1989) An analysis of the quality of research reports in the Journal of General Internal Medicine. J. Gen. Intern. Med., 4, 232- 6. [14. ]

Cox, D. R. (1972) 回归模型与生存表。J. Roy. Statist. Soc. B., 34, 187-220. [13.6]
Cox, D. R. (1972) Regression models and life tables. J. Roy. Statist. Soc. B., 34, 187- 220. [13.6]

Cox, D. R. (1982) 统计显著性检验。Br. J. Clin. Pharmacol., 14, 325-331. [8.8.1]
Cox, D. R. (1982) Statistical significance tests. Br. J. Clin. Pharmacol., 14, 325- 31. [8.8.1]

Cox, D. R. (1983) 对 P. Armitage 论文的讨论。J. Roy. Statist. Soc. A, 146, 332-3. [16.]
Cox, D. R. (1983) Discussion of paper by P. Armitage. J. Roy. Statist. Soc. A, 146, 332- 3. [16. ]

Cuckle, H. S., Wald, N. J. 和 Lindenbaum, R. H. (1986) 脐带血清甲胎蛋白与唐氏综合征。Br. J. Obstet. Gynaecol., 93, 408-10. [5.10.1]
Cuckle, H. S., Wald, N. J. and Lindenbaum, R. H. (1986) Cord serum alpha- fetoprotein and Down's syndrome. Br. J. Obstet. Gynaecol., 93, 408- 10. [5.10.1]

Cuzick, J. (1985) 一种Wilcoxon型趋势检验。Stat. Med., 4, 87-90. [9.8.7]
Cuzick, J. (1985) A Wilcoxon- type test for trend. Stat. Med., 4, 87- 90. [9.8.7]

Dallal, G. E. (1988) 统计微型计算—实际应用。Am. Stat., 42, 212-16. [6.3, 6.4, 6.5]
Dallal, G. E. (1988) Statistical microcomputing - like it is. Am. Stat., 42, 212- 16. [6.3, 6.4, 6.5]

Dalton, M. E., Bromham, D. R., Ambrose, C. L., 等 (1987) 女性鼻腔吸收孕酮。Br. J. Obstet. Gynaecol., 94, 84-8. [14.6.1, 14.6.3]
Dalton, M. E., Bromham, D. R., Ambrose, C. L., et al. (1987) Nasal absorption of progesterone in women. Br. J. Obstet. Gynaecol., 94, 84- 8. [14.6.1, 14.6.3]

Daly, B. M. 和 Shuster, S. (1986) 阿司匹林对瘙痒的影响。Br. Med. J., 293, 907. [Ex 15.3]
Daly, B. M. and Shuster, S. (1986) Effect of aspirin on pruritis. Br. Med. J., 293, 907. [Ex 15.3]

Davis, P. J. M. (1985) 口服避孕药与更年期。Update, 15 April, 799-802. [Ex 5.2]
Davis, P. J. M. (1985) The oral contraceptive pill and the menopause. Update, 15 April, 799- 802. [Ex 5.2]

麦克马斯特大学临床流行病学与生物统计学系 (1983) 诊断数据的解释。Can. Med. Assoc. J., 129, 429-32, 559-64, 586, 705-10, 832-5, 947-54, 1093-9. [14.4.8]
Department of Clinical Epidemiology and Biostatistics, McMaster University (1983) Interpretation of diagnostic data. Can. Med. Assoc. J., 129, 429- 32, 559- 64, 586, 705- 10, 832- 5, 947- 54, 1093- 9. [14.4.8]

De Pauw, M. 和 Buyse, M. (1984) 癌症临床试验表单设计。载于:《癌症临床试验:方法与实践》(主编 M. E. Buyse, M. J. Staquet 和 R. J. Sylvester)。牛津:牛津大学出版社, 64-82. [6.7]
De Pauw, M. and Buyse, M. (1984) Design of forms for cancer clinical trials. In: Cancer Clinical Trials. Methods and Practice (eds M. E. Buyse, M. J. Staquet and R. J. Sylvester). Oxford: University Press, 64- 82. [6.7]

DerSimonian, R., Charette, L. J., McPeek, B. 和 Mosteller, F. (1982) 临床试验方法报告。N. Engl. J. Med., 306, 1332-7. [16.3.7]
DerSimonian, R., Charette, L. J., McPeek, B. and Mosteller, F. (1982) Reporting on methods in clinical trials. N. Engl. J. Med., 306, 1332- 7. [16.3.7]

Dittrich, H., Gilpin, E., Nicod, P., 等 (1988) 女性急性心肌梗死:性别对死亡率和预后变量的影响。美国心脏病学杂志,62,1-7。[8.4.2]
Dittrich, H., Gilpin, E., Nicod, P., et al. (1988) Acute myocardial infarction in women: influence of gender on mortality and prognostic variables. Am. J. Cardiol., 62, 1- 7. [8.4.2]

Doll, R. 和 Hill, A. B. (1950) 吸烟与肺癌。初步报告。英国医学杂志,ii,739-748。[4.1]
Doll, R. and Hill, A. B. (1950) Smoking and carcinoma of the lung. Preliminary report. Br. Med. J., ii, 739- 48. [4.1]

Drum, D. E. 和 Christacapoulos, J. S. (1972) 临床决策中的肝脏闪烁显像。核医学杂志,13,908-915。[14.4]
Drum, D. E. and Christacapoulos, J. S. (1972) Hepatic scintigraphy in clinical decision making. J. Nucl. Med., 13, 908- 15. [14.4]

Dunn, H. L. (1929) 统计方法在生理学中的应用。生理学评论,9,275-398。[16.2, 16.3.1]
Dunn, H. L. (1929) Application of statistical methods in physiology. Physiol. Rev., 9, 275- 398. [16.2, 16.3.1]

Elashoff, J. D. (1983) 生存比例风险模型。肝脏病学,3,1031-1035。[13.6.3]
Elashoff, J. D. (1983) Surviving proportional hazards. Hepatology, 3, 1031- 5. [13.6.3]

Ellenberg, J. H. 和 Nelson, K. B. (1980) 样本选择与疾病自然史。热性惊厥研究。美国医学会杂志,243,1337-1340。[5.11.1]
Ellenberg, J. H. and Nelson, K. B. (1980) Sample selection and the natural history of disease. Studies of febrile seizures. J. Am. Med. Ass., 243, 1337- 40. [5.11.1]

Ellenberg, S. S. (1984) 比较临床试验中的随机化设计。新英格兰医学杂志,310,1404-1408。[15.2.5]
Ellenberg, S. S. (1984) Randomization designs in comparative clinical trials. N. Engl. J. Med., 310, 1404- 8. [15.2.5]

Elwood, P. C. (1982) 随机对照试验:抽样。英国临床药理学杂志,13,631-636。[15.2.8, 15.5.2]
Elwood, P. C. (1982) Randomised controlled trials: sampling. Br. J. Clin. Pharmacol. 13, 631- 6. [15.2.8, 15.5.2]

Emerson, J. D. 和 Colditz, G. A. (1983) 新英格兰医学杂志中的统计分析应用。新英格兰医学杂志,309,709-713。[16.2]
Emerson, J. D. and Colditz, G. A. (1983) Use of statistical analysis in the New England Journal of Medicine. N. Engl. J. Med., 309, 709- 13. [16.2]

跨机构监管联络组流行病学工作组 (1981) 流行病学研究文献记录指南。美国流行病学杂志,114,609-613。[16.4]
Epidemiology Work Group of the Interagency Regulatory Liaison Group (1981) Guidelines for the documentation of epidemiological studies. Am. J. Epidemiol., 114, 609- 13. [16.4]

Espeland, M. A. 和 Handelman, S. L. (1989) 使用潜类模型描述和评估离散测量中的相对误差。Biometrics, 45, 587-99。[示例 14.5]
Espeland, M. A. and Handelman, S. L. (1989) Using latent class models to characterize and assess relative error in discrete measurements. Biometrics, 45, 587- 99. [Ex 14.5]

Evans, S. J. W., Mills, P. 和 Dawson, J. (1988) p 值的终结?Br. Heart J., 60, 177-80。[16.5]
Evans, S. J. W., Mills, P. and Dawson, J. (1988) The end of the p value? Br. Heart J., 60, 177- 80. [16.5]

Feingold, K. R., Browner, W. S. 和 Siperstein, W. D. (1989) 预测性研究:糖尿病前期患者肌肉毛细血管基底膜宽度。J. Clin. Endocrinol. Metab., 69, 784-9。[示例 8.2]
Feingold, K. R., Browner, W. S. and Siperstein, W. D. (1989) Prospective studies of muscle capillary basement membrane width in prediabetics. J. Clin. Endocrinol. Metab., 69, 784- 9. [Ex 8.2]

Feinstein, A. R. (1985) P 值风暴?Hypertension, 7, 313-18。[11.8]
Feinstein, A. R. (1985) Tempest in a P- pot? Hypertension, 7, 313- 18. [11.8]

Feinstein, A. R. (1988) 日常生活威胁流行病学研究中的科学标准。Science, 242, 1257-63。[1.1, 5.14]
Feinstein, A. R. (1988) Scientific standards in epidemiologic studies of the menace of daily life. Science, 242, 1257- 63. [1.1, 5.14]

Felson, D. T., Cupples, L. A. 和 Meenan, R. F. (1984) 《关节炎与风湿病》中的统计方法误用:1982 年与 1967-68 年比较。Arthritis Rheumatism, 27, 1018-22。[16.2]
Felson, D. T., Cupples, L. A. and Meenan, R. F. (1984) Misuse of statistical methods in Arthritis and Rheumatism. 1982 versus 1967- 68. Arthritis Rheumatism, 27, 1018- 22. [16.2]

Fentiman, L. S., Rubens, R. D. 和 Hayward, J. L. (1983) 乳腺癌患者胸腔积液的控制:一项随机试验。Cancer, 52, 737-9。[15.2.3]
Fentiman, L. S., Rubens, R. D. and Hayward, J. L. (1983) Control of pleural effusions in patients with breast cancer. A randomized trial. Cancer, 52, 737- 9. [15.2.3]

Fentress, D. W., Masek, B. J., Mehegan, J. E. 和 Benson, H. (1986) 生物反馈和放松反应在儿童偏头痛治疗中的应用。Dev. Med. Child Neurol., 28, 139-46。[9.8.6]
Fentress, D. W., Masek, B. J., Mehegan, J. E. and Benson, H. (1986) Biofeedback and relaxation- response in the treatment of pediatric migraine. Dev. Med. Child Neurol. 28, 139- 46. [9.8.6]

Festing, M. F. W. (1981) “定义”动物与减少动物使用。在:《动物实验的新视角》(主编 D. Sperlinger)。Chichester: Wiley, 285-306。[5.7.4]
Festing, M. F. W. (1981) The 'defined' animal and the reduction of animal use. In: New Perspectives in Animal Experimentation (ed. D. Sperlinger). Chichester: Wiley, 285- 306. [5.7.4]

Fisher, R. A. 和 Yates, F. (1963) 生物学、农业和医学研究统计表,第6版。Edinburgh: Oliver and Boyd。[附录 B]
Fisher, R. A. and Yates, F. (1963) Statistical Tables for Biological, Agricultural and Medical Research, 6th edn. Edinburgh: Oliver and Boyd. [App. B]

Fleiss, J. L. (1981) 率和比例的统计方法,第2版。纽约:Wiley。[10.5, 10.8.2, 10.10, 10.11.2, 10.11.3, 14.3.4]
Fleiss, J. L. (1981) Statistical Methods for Rates and Proportions, 2nd edn. New York: Wiley. [10.5, 10.8.2, 10.10, 10.11.2, 10.11.3, 14.3.4]

Fleming, D. M. 和 Crombie, D. L. (1987) 英格兰和威尔士哮喘与花粉热的患病率。英国医学杂志,294,279-283。[8.3]
Fleming, D. M. and Crombie, D. L. (1987) Prevalence of asthma and hay fever in England and Wales. Br. Med. J., 294, 279- 83. [8.3]

Fletcher, R. H. 和 Fletcher, S. W. (1979) 一般医学期刊中的临床研究:30年回顾。新英格兰医学杂志,301,180-183。[16.2, 16.3.2]
Fletcher, R. H. and Fletcher, S. W. (1979) Clinical research in general medical journals. A 30- year perspective. N. Engl. J. Med., 301, 180- 3. [16.2, 16.3.2]

Frame, S., Moore, J., Peters, A. 和 Hall, D. (1985) 母亲身高和鞋码作为骨盆不称比例预测指标的评估。英国妇产科杂志,92,1239-1245。[10.8.2]
Frame, S., Moore, J., Peters, A. and Hall, D. (1985) Maternal height and shoe size as predictors of pelvic disproportion: an assessment. Br. J. Obstet. Gynaecol., 92, 1239- 45. [10.8.2]

Freedman, L. S. (1979) 使用Kolmogorov-Smirnov型统计量检验季节变化假设。流行病学与社区健康杂志,33,223-228。[14.7]
Freedman, L. S. (1979) The use of a Kolmogorov- Smirnov type statistic in testing hypotheses about seasonal variation. J. Epidemiol. Comm. Health., 33, 223- 8. [14.7]

Freiman, J. A., Chalmers, T. C., Smith, H. 和 Kuebler, R. R. (1978) β错误(第二类错误)和样本量在随机对照试验设计与解释中的重要性:71个“阴性”试验的调查。新英格兰医学杂志,299,690-694。[8.5.4, 16.3.2]
Freiman, J. A., Chalmers, T. C., Smith, H. and Kuebler, R. R. (1978) The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 'negative' trials. N. Engl. J. Med., 299, 690- 4. [8.5.4, 16.3.2]

Furst, D. E. 和 Paulus, H. E. (1975) 类风湿关节炎对克洛尼辛代谢无影响。临床药理学与治疗,17,622-626。[例14.2]
Furst, D. E. and Paulus, H. E. (1975) Lack of effect of rheumatoid arthritis on clonixin metabolism. Clin. Pharmacol. Ther., 17, 622- 6. [Ex 14.2]

Galen, R. S. 和 Gambino, S. R. (1975) 超越正态性:医学诊断的预测价值与效率,纽约:Wiley。[14.4.8]
Galen, R. S. and Gambino, S. R. (1975) Beyond Normality: the predictive value and efficiency of medical diagnosis, New York: Wiley. [14.4.8]

Gardner, M. J. 和 Altman, D. G. (1989a) 置信估计。在:Statistics with Confidence(编辑 M. J. Gardner 和 D. G. Altman),伦敦:英国医学杂志,3-5页。[8.8, 16.3.6]
Gardner, M. J. and Altman, D. G. (1989a) Estimating with confidence. In: Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London: British Medical Journal, 3- 5. [8.8, 16.3.6]

Gardner, M. J. 和 Altman, D. G. (1989b) 估计而非假设检验:置信区间而非P值。在:Statistics with Confidence(编辑 M. J. Gardner 和 D. G. Altman),伦敦:英国医学杂志,6-19页。[8.8.1]
Gardner, M. J. and Altman, D. G. (1989b) Estimation rather than hypothesis testing: confidence intervals rather than P values. In: Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London: British Medical Journal, 6- 19. [8.8.1]

Gardner, M. J. 和 Altman, D. G.(编辑)(1989c)《Statistics with Confidence》,伦敦:英国医学杂志出版社。[附录B]
Gardner, M. J. and Altman, D. G. (eds) (1989c) Statistics with Confidence, London: British Medical Journal. [App B]

Gardner, M. J., Machin, D. 和 Campbell, M. J.(1989)在评估医学研究统计内容时使用核对表。载于:Statistics with Confidence。(编辑 M. J. Gardner, D. G. Altman)伦敦:英国医学杂志出版社,101-108页。[15.6.1, 16.4]
Gardner, M. J., Machin, D. and Campbell, M. J. (1989) Use of check lists in assessing the statistical content of medical studies. In: Statistics with Confidence. (eds M. J. Gardner, D. G. Altman) London: British Medical Journal, 101- 8. [15.6.1, 16.4]

Gart, J. J., Krewski, D., Lee, P. N. 等(1986)《癌症研究中的统计方法》第三卷—长期动物实验的设计与分析,里昂:国际癌症研究机构,29-32页。[5.7.4]
Gart, J. J., Krewski, D., Lee, P. N., et al. (1986) Statistical Methods in Cancer Research. Volume III - The design and analysis of long- term animal experiments, Lyon: IARC, 29- 32. [5.7.4]

Gehlbach, S. H.(1982)《解读医学文献:临床医生指南》,列克星敦马萨诸塞州:D. C. Heath and Co. 出版社。[5.14]
Gehlbach, S. H. (1982) Interpreting the Medical Literature. A clinician's guide, Lexington Mass.: D. C. Heath and Co. [5.14]

George, S. L.(1985)医学期刊中的统计学:当前政策调查及对编辑的建议。医学儿科肿瘤学,13,109-112页。[16.3.10]
George, S. L. (1985) Statistics in medical journals: a survey of current policies and proposals for editors. Med. Pediatr. Oncol., 13, 109- 12. [16.3.10]

Gibbons, R. D. 和 Davis, J. M.(1984)啤酒价格与神职人员薪资:纵向精神病学数据的分析与展示。精神病学档案,41,1183-1184页。[5.13]
Gibbons, R. D. and Davis, J. M. (1984) The price of beer and the salaries of priests: analysis and display of longitudinal psychiatric data. Arch. Gen. Psychiatry, 41, 1183- 4. [5.13]

Gibson, T., Grahame, R., Harkness, J. 等(1985)短波透热治疗与整骨治疗对非特异性下背痛的对照比较。柳叶刀,1,1258-1261页。[3.5.1]
Gibson, T., Grahame, R., Harkness, J., et al. (1985) Controlled comparison of short- wave diathermy treatment with osteopathic treatment in non- specific low back pain. Lancet, 1, 1258- 61. [3.5.1]

Glasser, G. J. 和 Winter, R. F.(1961)秩相关系数临界值用于独立性假设检验。生物计量学,48,444-448页。[附录B]
Glasser, G. J. and Winter R. F. (1961) Critical values of the coefficient of rank correlation for testing the hypothesis of independence. Biometrika, 48, 444- 8. [App B]

Gore, S. M.(1982)评估方法—数据转换。载于:Statistics in Practice(编辑 S. M. Gore 和 D. G. Altman),伦敦:英国医学协会,67-69页。[7.6.1]
Gore, S. M. (1982) Assessing methods - transforming the data. In: Statistics in Practice (eds S. M. Gore and D. G. Altman), London: British Medical Association, 67- 9. [7.6.1]

Gottzsche, P. C.(1989)类风湿性关节炎中196项非甾体抗炎药双盲试验报告中的方法学及显性与隐性偏倚。对照临床试验,10,31-56页。[15.4.7]
Gottzsche, P. C. (1989) Methodology and overt and hidden bias in reports of 196 double- blind trials of non- steroidal antiinflammatory drugs in rheumatoid arthritis. Controlled Clin. Trials, 10, 31- 56. [15.4.7]

Gould, B. A., Hornung, R. S., Kieso, H. A. 等 (1985) 两臂血压是否相同?临床心脏病学,8,423-426。[5.4, 8.5.1]
Gould, B. A., Hornung, R. S., Kieso, H. A., et al. (1985) Is the blood pressure the same in both arms? Clin. Cardiol., 8, 423- 6. [5.4, 8.5.1]

Grant, A. (1989) 控制试验的报告。英国妇产科学杂志,96,397-400。[15.6.1, 16.4]
Grant, A. (1989) Reporting controlled trials. Br. J. Obstet. Gynaecol., 96, 397- 400. [15.6.1, 16.4]

Gray-Donald, K. 和 Kramer, M. S. (1988) 观察性研究与实验性研究中的因果推断:一项实证比较。美国流行病学杂志,127,885-892。[5.9]
Gray- Donald, K. and Kramer, M. S. (1988) Causality inference in observational vs. experimental studies. An empirical comparison. Am. J. Epidemiol., 127, 885- 92. [5.9]

Green, S. B. 和 Byar, D. P. (1984) 利用登记处的观察数据比较治疗方法:全指标的谬误。统计医学,3,361-370。[16.3.2]
Green, S. B. and Byar, D. P. (1984) Using observational data from registries to compare treatments: the fallacy of omnimetrics. Stat. Med., 3, 361- 70. [16.3.2]

Greenwood, M. (1932) 医学课程的问题何在?柳叶刀,i,1269-1270。[16.3]
Greenwood, M. (1932) What is wrong with the medical curriculum? Lancet, i. 1269- 70. [16.3]

Greenwood, M. (1948) 统计学家与医学研究。英国医学杂志,2,467-468。[序言]
Greenwood, M. (1948) The statistician and medical research. Br. Med. J., 2, 467- 8. [Preface]

Guyatt, G. H., Townsend, M., Kazim, F. 和 Newhouse, M. T. (1987) 氨溴索治疗慢性支气管炎的控制试验。胸部,92,618-620。[15.3.2]
Guyatt, G. H., Townsend, M., Kazim, F. and Newhouse, M. T. (1987) A controlled trial of ambroxol in chronic bronchitis. Chest, 92, 618- 20. [15.3.2]

Halkin, H., Sheiner, L. B., Peck, C. C. 和 Melmon, K. L. (1975) 地高辛肾清除率的决定因素。临床药理与治疗,17,385-394。[例11.4]
Halkin, H., Sheiner, L. B., Peck, C. C. and Melmon, K. L. (1975) Determinants of the renal clearance of digoxin. Clin. Pharmacol. Ther., 17, 385- 94. [Ex 11.4]

Halpern, D. F. 和 Coren, S. (1988) 右撇子寿命更长吗?自然,333,213。[1.1, 5.14, 例5.3]
Halpern, D. F. and Coren, S. (1988) Do right- handers live longer? Nature, 333, 213. [1.1, 5.14, Ex 5.3]

Hampton, J. R. (1981) 心血管疾病临床试验结果的呈现与分析。英国医学杂志,282,1371-1373。[15.6.1]
Hampton, J. R. (1981) Presentation and analysis of the results of clinical trials in cardiovascular disease. Br. Med. J., 282, 1371- 3. [15.6.1]

Hayden, G. F. (1983) 儿科生物统计趋势:对未来的影响。Pediatrics, 72, 84-87. [16.2]
Hayden, G. F. (1983) Biostatistical trends in Pediatrics: implications for the future. Pediatrics, 72, 84- 7. [16.2]

Hayes, R. J. (1988) 评估变化是否依赖初始值的方法。Stat. Med., 7, 915-927. [11.3.5]
Hayes, R. J. (1988) Methods for assessing whether change depends on initial value. Stat. Med., 7, 915- 27. [11.3.5]

Hill, A. B. (1963) 医学伦理与对照试验。Br. Med. J., i, 1043-1049. [15., 15.2.9]
Hill, A. B. (1963) Medical ethics and controlled trials. Br. Med. J., i, 1043- 9. [15. , 15.2.9]

Hill, A. B. (1984) 《医学统计简明教材》,第11版,伦敦:霍德与斯托顿出版社。 [15.1, 16.2]
Hill, A. B. (1984) A Short Textbook of Medical Statistics, 11th edn, London: Hodder and Stoughton. [15.1, 16.2]

Hofacker, C. F. (1983) 统计软件滥用:一般线性模型的案例。Am. J. Physiol., 245, R299-R302. [6]
Hofacker, C. F. (1983) Abuse of statistical packages: the case of the general linear model. Am. J. Physiol., 245, R299- R302. [6]

Hogben, L. (1950) 《机会与选择:卡牌与棋盘》,卷1,纽约:Chanticleer出版社,无页码。 [16.3]
Hogben, L. (1950) Chance and Choice by Cardpack and Chessboard, Volume 1, New York: Chanticleer Press Unnumbered page. [16.3]

Hommel, E., Parving, H.-H., Mathiesen, E., 等 (1986) 卡托普利对胰岛素依赖型糖尿病肾病患者肾功能的影响。Br. Med. J., 293, 467-470. [Ex 15.4]
Hommel, E., Parving, H.- H., Mathiesen, E., et al. (1986) Effect of captopril on kidney function in insulin- dependent diabetic patients with nephropathy. Br. Med. J., 293, 467- 70. [Ex 15.4]

Hoogstraten, B. (1984) 实体肿瘤治疗结果报告。载于:《癌症临床试验:方法与实践》(主编 M. E. Buyse, M. J. Staquet, R. J. Sylvester),牛津大学出版社,139-156页。 [16.3]
Hoogstraten, B. (1984) Reporting treatment results in solid tumours. In: Cancer Clinical Trials. Methods and Practice. (eds M. E. Buyse, M. J. Staquet, and R. J. Sylvester), Oxford: University Press, 139- 56. [16.3]

Hughes, R. E. 和 Jones, E. (1985) 膳食纤维摄入与初潮年龄。Ann. Hum. Biol., 12, 325-332. [11.4]
Hughes, R. E. and Jones, E. (1985) Intake of dietary fibre and the age of menarche. Ann. Hum. Biol., 12, 325- 32. [11.4]

Hulse, J. A., Jackson, D., Grant, D. B., 等 (1979) 通过筛查诊断的甲状腺功能减退婴儿的不同甲状腺功能测量。Acta Paediatr. Scand. Suppl., 277, 21-25. [9.6.5]
Hulse, J. A., Jackson, D., Grant, D. B., et al. (1979) Different measurements of thyroid function in hypothyroid infants diagnosed by screening, Acta Paediatr. Scand. Suppl., 277, 21- 5. [9.6.5]

Ingelfinger, J. A., Mosteller, F., Thibodeau, L. A. 和 Ware, J. H. (1987) 《临床医学中的生物统计学》,第二版,纽约:麦克米兰出版社。[14.4.6, 14.4.8]
Ingelfinger, J. A., Mosteller, F., Thibodeau, L. A. and Ware, J. H. (1987) Biostatistics in Clinical Medicine, 2nd edn, New York: Macmillan. [14.4.6, 14.4.8]

Isaacs, D., Altman, D. G., Tidmarsh, C. E. 等 (1983) 用激光散射光度法测定学龄前儿童血清免疫球蛋白浓度:IgG、IgA、IgM的参考范围。临床病理学杂志,36,1193-1196。[3.3.1, 14.5.2, 14.5.4]
Isaacs, D., Altman, D. G., Tidmarsh, C. E., et al. (1983) Serum immunoglobulin concentrations in preschool children measured by laser nephelometry: reference ranges for IgG, IgA, IgM. J. Clin. Pathol. 36, 1193- 6. [3.3.1, 14.5.2, 14.5.4]

James, W. H. (1985) 异卵双胞胎、出生体重与纬度关系。人体生物学年鉴,12,441-447。[11.5]
James, W. H. (1985) Dizygotic twinning, birth weight and latitude. Ann. Hum. Biol., 12, 441- 7. [11.5]

Johnson, A. L. 和 Altman, D.G. (1990) 医学文献中统计质量综述调查。第一部分:方法与参考书目(1987年前)。准备中。[16.3.1]
Johnson, A. L. and Altman, D.G. (1990) A survey of reviews of the quality of statistics in the medical literature. I. Methods and bibliography (pre- 1987). In preparation [16.3.1]

Kahan, A., Amor, B., Menkes, C. J. 等 (1987) 尼卡地平治疗雷诺现象的随机双盲试验。血管学,38,333-337。[15.4.10]
Kahan, A., Amor, B., Menkes, C. J., et al. (1987) Nicardipine in the treatment of Raynaud's phenomenon: a randomized double- blind trial. Angiology, 38, 333- 7 [15.4.10]

Kahneman, D. 和 Tversky, A. (1982) 主观概率:代表性判断。载于《不确定性下的判断:启发式与偏差》(主编 D. Kahneman, P. Slovic 和 A. Tversky),剑桥:大学出版社,32-47页 [示例 8.1]
Kahneman, D. and Tversky, A. (1982) Subjective probability: a judgement of representativeness. In Judgement under Uncertainty: Heuristics and Biases (eds D. Kahneman, P. Slovic and A. Tversky), Cambridge: University Press, 32- 47 [Ex 8.1]

Karacan, I., Fernandez-Salas, A., Coggins, W. S. 等 (1976) 慢性大麻使用者的睡眠脑电图-眼电图特征:第一部分。纽约科学院年报,282卷,348-374页。[10.4, 10.4.1]
Karacan, I., Fernandez- Salas, A., Coggins, W. S., et al. (1976) Sleep electroencephalographic- electrooculographic characteristics of chronic marijuana users: part 1. Ann. NY Acad. Sci., 282, 348- 74. [10.4, 10.4.1]

Kendell, R. E., de Roumanie, M. 和 Ritson, E. B. (1983) 增加酒精消费税对酒精消费及其不良影响的影响。英国医学杂志。
Kendell, R. E., de Roumanie, M. and Ritson, E. B. (1983) Influence of an increase in excise duty on alcohol consumption and its adverse effects. Br. Med.

J., 287, 809-11 [例 5.1] Kimpen, J., Callaert, H., Embrechts, P. 和 Bosmans, E. (1987) 脐带血 IgE 与出生月份。儿童疾病档案,62,478-82。[14.7] Kirwan, J. R., Byron, M. A., Winfield, J. 等 (1979) 评估膝关节滑膜炎的周长测量。风湿病康复,18,78-84。[14.2.4] Kitson, T. (1984) 终极一英里。新科学家,103(1415),34。[11.14] Koehn, H. D. 和 Mostbeck, A. (1981) 血清中免疫反应性胰蛋白酶浓度的年龄依赖性。临床化学,27,502。[9.8.5] Kuntze, C. E. E., Ebels, T., Eijgelaar, A. 和 Homan van der Heide, J. N. (1989) 三种不同机械心脏瓣膜假体的血栓栓塞发生率:随机研究。柳叶刀,1,514-17。[16.3.3, 16.3.7] Kurjak, A., Latin, V. 和 Polak, J. (1978) 通过测量四个胎儿维度超声识别两种生长迟缓类型。围产医学杂志,6,102-8。[10.11.1] Lachenbruch, P. A. (1977) 判别分析的一些误用。医学信息方法,16,255-8。[12., 12.6] Lam, K. C., Lai, C. L., Ng, R. P. 等 (1981) 泼尼松龙对乙型肝炎表面抗原阳性慢性活动性肝炎的不良影响。新英格兰医学杂志,304,380-6。[例 8.4] Landis, J. R. 和 Koch, G. G. (1977) 类别数据观察者一致性的测量。生物计量学,33,159-74。[14.3.1] Langhoff-Roos, J., Lindmark, G., Gustavson, K-H. 等 (1987) 父母出生体重对足月婴儿出生体重的相对影响。临床遗传学,32,240-8。[12.4] Leitch, I., Hytten, F. E. 和 Billewicz, W. Z. (1959) 一些哺乳动物的母体和新生儿体重。伦敦动物学会学报,133,11-28。[11.2, 11.2.1] Lentner, C. (编) (1982) Geigy 科学表,第二卷,第八版,巴塞尔:Ciba-Geigy。[4.9.1, 附录 B] Lewith, G. T. 和 Machin, D. (1981) 一项随机试验评估红外线刺激局部触发点对颈椎骨关节病疼痛的影响,与安慰剂比较。国际针灸电疗研究杂志,6,277-84。[10.3, 10.7.1] Lichtenstein, M. J., Mulrow, C. D. 和 Elwood, P. C. (1987) 病例对照研究阅读指南。慢性病杂志,40,893-903。[5.14, 16.4] Light, I. M., Avery, A. 和 Grieve, A. M. (1987) 浸水服绝缘性能:湿润对生存估计的影响。航空航天环境医学,58,964-9。[12.3.5] Lind, T., Godfrey, K. A., Otun, H. 和 Philips, P. R. (1984) 正常妊娠期间血清尿酸浓度的变化。英国妇产科杂志,91,128-32。[3.5.1, 9.10] Linnet, K. (1987) 两阶段变换系统用于参考分布的正态化评估。临床化学,33,381-6。[14.5.3] Lippman, M. E., Cassidy, J., Wesley, M. 和 Young, R. C. (1984) 一项随机试图通过激素同步增加转移性乳腺癌细胞毒化疗效果的研究。临床肿瘤学杂志,2,28-36。[16.3.4] Lucey, M. R. (1987) 报告临床试验时需要置信区间。肠胃,28,916-17。[例 16.8] Lumley, J., McKinnon, L. 和 Wood, C. (1971) 胎儿头皮血正常值缺乏一致性。英国英联邦妇产科杂志,78,13-21。[14.5.3] Lyster, W. R. (1984) 男低音歌手的兄弟姐妹性别比例更偏向男性
J., 287, 809-11 [Ex 5.1] Kimpen, J., Callaert, H., Embrechts, P. and Bosmans, E. (1987) Cord blood IgE and month of birth. Arch. Dis. Child., 62, 478-82. [14.7] Kirwan, J. R., Byron, M. A., Winfield, J., et al. (1979) Circumferential measurements in the assessment of synovitis of the knee. Rheumatol. Rehab., 18, 78-84. [14.2.4] Kitson, T. (1984) The ultimate mile. New Scientist, 103 (1415), 34. [11.14] Koehn, H. D. and Mostbeck, A. (1981) Age-dependence of immunoreactive trypsin concentrations in serum. Clin. Chem., 27, 502. [9.8.5] Kuntze, C. E. E., Ebels, T., Eijgelaar, A. and Homan van der Heide, J. N. (1989) Rates of thromboembolism with three different mechanical heart valve prostheses: randomised study. Lancet, 1, 514-17. [16.3.3, 16.3.7] Kurjak, A., Latin, V. and Polak, J. (1978) Ultrasonic recognition of two types of growth retardation by measurement of four fetal dimensions. J. Perinat. Med., 6, 102-8. [10.11.1] Lachenbruch, P. A. (1977) Some misuses of discriminant analysis. Meth. Inform. Med., 16, 255-8. [12., 12.6] Lam, K. C., Lai, C. L., Ng, R. P., et al. (1981) Deleterious effect of prednisolone in HBsAg-positive chronic active hepatitis. N. Engl. J. Med., 304, 380-6. [Ex 8.4] Landis, J. R. and Koch, G. G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 159-74. [14.3.1] Langhoff-Roos, J., Lindmark, G., Gustavson, K-H., et al. (1987) Relative effect of parental birth weight on infant birth weight at term. Clin. Genet., 32, 240-8. [12.4] Leitch, I., Hytten, F. E. and Billewicz, W. Z. (1959) The maternal and neonatal weights of some mammalia. Proc. Zool. Soc. Lond., 133, 11-28. [11.2, 11.2.1] Lentner, C. (Ed.) (1982) Geigy Scientific Tables, Volume 2, 8th edn, Basel: Ciba-Geigy. [4.9.1, App. B] Lewith, G. T. and Machin, D. (1981) A randomised trial to evaluate the effect of infra-red stimulation of local trigger points, versus placebo, on the pain caused by cervical osteoarthrosis. Int. J. Acupuncture Electro-Therapeut. Res., 6, 277-84. [10.3, 10.7.1] Lichtenstein, M. J., Mulrow, C. D. and Elwood, P. C. (1987) Guidelines for reading case-control studies. J. Chron. Dis., 40, 893-903. [5.14, 16.4] Light, I. M., Avery, A. and Grieve, A. M. (1987) Immersion suit insulation: the effect of dampening on survival estimates. Aviat. Space Environ. Med., 58, 964-9. [12.3.5] Lind, T., Godfrey, K. A., Otun, H. and Philips, P. R. (1984) Changes in serum uric acid concentrations during normal pregnancy. Br. J. Obstet. Gynaecol., 91, 128-32. [3.5.1, 9.10] Linnet, K. (1987) Two-stage transformation systems for normalisation of reference distributions evaluated. Clin. Chem., 33, 381-6. [14.5.3] Lippman, M. E., Cassidy, J., Wesley, M. and Young, R. C. (1984) A randomized attempt to increase the efficacy of cytotoxic chemotherapy in metastatic breast cancer by hormonal synchronization. J. Clin. Oncol., 2, 28-36. [16.3.4] Lucey, M. R. (1987) The need for confidence intervals in reporting clinical trials. Gut, 28, 916-17. [Ex 16.8] Lumley, J., McKinnon, L. and Wood, C. (1971) Lack of agreement on normal values for fetal scalp blood. J. Obstet. Gynaecol. Br. Commlth., 78, 13-21. [14.5.3] Lyster, W. R. (1984) Bass singers have a more masculine sex ratio in their siblings

比男高音更多。IRCS Med. Sci., 12, 234。[例10.2] Macartney, F. J.(1987)诊断逻辑。Br. Med. J., 295, 1325-31。[14.4.8] Macfarlane, A. 和 Mugford, M.(1984)出生统计。妊娠与分娩统计,伦敦:HMSO,第49页。[3.1] Macgregor, I. D. M. 和 Balding, J. W.(1988)英国学童的就寝时间与家庭规模。Ann. Hum. Biol., 15, 435-41。[例10.4] Machin, D. 和 Campbell, M. J.(1987)临床试验设计统计表,牛津:Blackwell,[13.7, 15.3.2] Mackenzie, S. G. 和 Lippman, A.(1989)妊娠结局病例对照研究中的报告偏倚调查。Am. J. Epidemiol., 129, 65-75。[5.10.3] Mainland, D.(1950)临床研究中的统计学:一些通用原则。Ann. NY Acad. Sci., 52(6), 922-30。[16.5] Manocha, S., Choudhuri, G. 和 Tandon, B.N.(1986)月经前后期饮食摄入研究。Hum. Nut.: Appl. Nut., 40A, 213-16。[9.4, 9.5] Maron, D. J., Telch, M. J., Killen, J. D. 等(1986)青少年安全带使用的相关因素:健康促进的启示。Prev. Med., 15, 614-23。[8.8.4] Martin, T. R. 和 Bracken, M. B.(1987)低出生体重与妊娠期间咖啡因摄入的关联。Am. J. Epidemiol., 126, 813-21。[5.11.2, 10.6.1] Maskin, C. S., Ocken, S., Chadwick, B. 和 Le Jemtel, T. H.(1985)多巴胺与依那普利(血管紧张素转换酶抑制剂)在心力衰竭患者中的系统及肾脏效应比较。Circulation, 72, 846-52。[12.3.1] Matsukura, S., Taminato, T., Kitano, N. 等(1984)环境烟草烟雾对非吸烟者尿中可替宁排泄的影响。N. Engl. J. Med., 311, 828-32。[例9.5] Matthews, J. N. S., Altman, D. G., Campbell, M. J. 和 Royston, J. P.(1990)医学研究中序列测量的分析。Br. Med. J., 300, 230-5。[14.6.1] Mattila, K. J., Nieminen, M. S., Valtonen, V. V. 等(1989)牙齿健康与急性心肌梗死的关联。Br. Med. J., 298, 779-82。[5.10.6] May, G. S., DeMets, D. L., Friedman, L. 等(1981)随机临床试验:分析中的偏倚。Circulation, 64, 669-73。[15.4] Mayes, L. C., Horwitz, R. I. 和 Feinstein, A. R.(1988)病例对照研究中56个结果矛盾主题的汇编。Int. J. Epidemiol., 17, 680-5。[5.10.6] Mazess, R. B., Peppler, W. W. 和 Gibbons, M.(1984)利用双光子()吸收测定全身组成。Am. J. Clin. Nut., 40, 834-9。[11.2, 11.4, 11.8] 医学研究委员会(MRC)(1948)链霉素治疗肺结核。Br. Med. J., ii, 769-82。[15.1, 16.2] Meier, P.(1981)临床试验设计中的分层。Controlled Clin. Trials, 1, 355-61。[15.2.2] Miao, L. L.(1977)胃冷冻术:随机临床试验评估医疗疗法的一个例子。载于《手术的成本、风险与益处》(主编 J. P. Bunker, B.A. Barnes 和 F. Mosteller),纽约:牛津大学出版社,198-211页。[15.2.1] Milledge, J. S., Beeley, J. M., McArthur, S. 和 Morice, A. M.(1989)心房利钠肽、高海拔与急性高山病。Clin. Sci., 77, 509-14。[例16.7]
than tenors. IRCS Med. Sci., 12, 234. [Ex 10.2]Macartney, F. J. (1987) Diagnostic logic. Br. Med. J., 295, 1325- 31. [14.4.8]Macfarlane, A. and Mugford, M. (1984) Birth Counts. Statistics of pregnancy and childbirth, London: HMSO, 49. [3.1]Macgregor, I. D. M. and Balding, J. W. (1988) Bedtimes and family size in English schoolchildren. Ann. Hum. Biol., 15, 435- 41. [Ex 10.4]Machin, D. and Campbell, M. J. (1987) Statistical Tables for the Design of Clinical Trials, Oxford: Blackwell, [13.7, 15.3.2]Mackenzie, S. G. and Lippman, A. (1989) An investigation of report bias in a case- control study of pregnancy outcome. Am. J. Epidemiol., 129, 65- 75. [5.10.3]Mainland, D. (1950) Statistics in clinical research: some general principles. Ann. NY Acad. Sci., 52(6), 922- 30 [16.5]Manocha, S., Choudhuri, G. and Tandon, B.N. (1986) A study of dietary intake in pre- and post- menstrual period. Hum. Nut.: Appl. Nut., 40A, 213- 16. [9.4, 9.5]Maron, D. J., Telch, M. J., Killen, J. D. et al. (1986) Correlates of seat- belt use by adolescents: implications for health promotion. Prev. Med., 15, 614- 23. [8.8.4]Martin, T. R. and Bracken, M. B. (1987) The association between low birth weight and caffeine consumption during pregnancy. Am. J. Epidemiol., 126, 813- 21. [5.11.2, 10.6.1]Maskin, C. S., Ocken, S., Chadwick, B. and Le Jemtel, T. H. (1985) Comparative systemic and renal effects of dopamine and angiotensin- converting enzyme inhibition with enalaprilot in patients with heart failure. Circulation, 72, 846- 52. [12.3.1]Matsukura, S., Taminato, T., Kitano, N., et al. (1984) Effects of environmental tobacco smoke on urinary cotinine excretion in nonsmokers. N. Engl. J. Med., 311, 828- 32. [Ex 9.5]Matthews, J. N. S., Altman, D. G., Campbell, M. J. and Royston, J. P. (1990) Analysis of serial measurements in medical research. Br. Med. J., 300, 230- 5. [14.6.1]Mattila, K. J., Nieminen, M. S., Valtonen, V. V., et al. (1989) Association between dental health and acute myocardial infarction. Br. Med. J., 298, 779- 82. [5.10.6]May, G. S., DeMets, D. L., Friedman, L., et al. (1981) The randomized clinical trial: bias in analysis. Circulation, 64, 669- 73. [15.4]Mayes, L. C., Horwitz, R. I. and Feinstein, A. R. (1988) A collection of 56 topics with contradictory results in case- control research. Int. J. Epidemiol., 17, 680- 5. [5.10.6]Mazess, R. B., Peppler, W. W. and Gibbons, M. (1984) Total body composition by dual- photon absorptiometry. Am. J. Clin. Nut., 40, 834- 9. [11.2, 11.4, 11.8]Medical Research Council (MRC) (1948) Streptomycin treatment of pulmonary tuberculosis. Br. Med. J., ii, 769- 82. [15.1, 16.2]Meier, P. (1981) Stratification in the design of a clinical trial. Controlled Clin. Trials, 1, 355- 61. [15.2.2]Miao, L. L. (1977) Gastric freezing: an example of the evaluation of medical therapy by randomized clinical trials. In Costs, Risks, and Benefits of Surgery (eds J. P. Bunker, B.A. Barnes and F. Mosteller), New York: Oxford University Press, 198- 211. [15.2.1]Milledge, J. S., Beeley, J. M., McArthur, S. and Morice, A. M. (1989) Atrial natriuretic peptide, altitude and acute mountain sickness. Clin. Sci., 77, 509- 14. [Ex 16.7]

Morris, J. A. 和 Gardner, M. J. (1989) 相对风险、比值比及标准化比率和率的置信区间计算。载于《带置信度的统计学》(编者 M. J. Gardner 和 D. G. Altman),伦敦:英国医学杂志,50-63 页。[10.11.2]
Morris, J. A. and Gardner, M. J. (1989) Calculating confidence intervals for relative risks, odds ratios, and standardized ratios and rates. In Statistics with Confidence (eds M. J. Gardner and D. G. Altman), London: British Medical Journal, 50- 63. [10.11.2]

Moses, L. E. (1987) 统计分析中的图形方法。公共卫生年评,8,309-353。[3.7.3]
Moses, L. E. (1987) Graphical methods in statistical analysis. Ann. Rev. Publ. Health, 8, 309- 53. [3.7.3]

Moses, L. E., Emerson, J. D. 和 Hosseini, H. (1984) 有序类别数据的分析。新英格兰医学杂志,311,442-448。[10.8.2, 10.13]
Moses, L. E., Emerson, J. D. and Hosseini, H. (1984) Analyzing data from ordered categories. N. Engl. J. Med., 311, 442- 8. [10.8.2, 10.13]

Mosteller, F., Gilbert, J. P. 和 McPeek, B. (1980) 控制试验的报告标准与研究策略。编辑议程。控制临床试验,1,37-58。[16.3.7]
Mosteller, F., Gilbert, J. P. and McPeek, B. (1980) Reporting standards and research strategies for controlled trials. Agenda for the editor. Controlled Clin. Trials, 1, 37- 58. [16.3.7]

Nanjji, A. A. 和 French, S. W. (1985) 猪肉消费与肝硬化的关系。柳叶刀,i,681-683。[11.8]
Nanjji, A. A. and French, S. W. (1985) Relationship between pork consumption and cirrhosis. Lancet, i, 681- 3. [11.8]

Ng, R. P., Lam, K. C., Lai, C. L. 和 Wu, P. C. (1981) 甲型乙肝表面抗原阳性慢性活动性肝炎中的泼尼松龙治疗。新英格兰医学杂志,305,283。[Ex 8.4]
Ng, R. P., Lam, K. C., Lai, C. L. and Wu, P. C. (1981) Prednisolone in HBsAg- positive chronic active hepatitis. N. Engl. J. Med., 305, 283. [Ex 8.4]

Noller, K. L. 和 Melton, L. J. (1985) 围产医学中的研究设计。美国围产医学杂志,2,250-255。[5.]
Noller, K. L. and Melton, L. J. (1985) Study design in perinatal medicine. Am. J. Perinatol., 2, 250- 5. [5. ]

Norton, P. G. 和 Dunn, E. V. (1985) 打鼾作为疾病风险因素的流行病学调查。英国医学杂志,291,630-632。[10.8.2, 12.5]
Norton, P. G. and Dunn, E. V. (1985) Snoring as a risk factor for disease: an epidemiological survey. Br. Med. J., 291, 630- 2. [10.8.2, 12.5]

Numerical Algorithms Group (1987) NAG Fortran 库 - 第12版。牛津:Numerical Algorithms Group。[附录B]
Numerical Algorithms Group (1987) The NAG Fortran Library - Mark 12. Oxford: Numerical Algorithms Group. [App B]

Oldham, P. D. (1979) 预测百分比作为肺功能检测正常界限的统计学有效方法。胸部,34,569。[14.]
Oldham, P. D. (1979) Per cent of predicted as the limit of normal in pulmonary function testing: a statistically valid approach. Thorax, 34, 569. [14. ]

Oldham, P. D. (1985) 斯特拉斯克莱德地区委员会供水氟化:Lord Jauncey 对 Mrs Catherine McColl 诉斯特拉斯克莱德地区委员会案的意见综述。J. Roy. Statist. Soc. A., 148, 37-44. [1.1]
Oldham, P. D. (1985) The fluoridation of the Strathclyde Regional Council's water supply: opinion of Lord Jauncey in causa Mrs Catherine McColl against Strathclyde Regional Council: a review. J. Roy. Statist. Soc. A., 148, 37- 44. [1.1]

O'Neill, S., Leahy, F., Pasterkamp, H. 和 Tal, A. (1983) 慢性肺过度充气、营养状态和体位对囊性纤维化患者呼吸肌力的影响。Am. Rev. Respir. Dis., 128, 1051-4. [3.2, 12.4]
O'Neill, S., Leahy, F., Pasterkamp, H. and Tal, A. (1983) The effects of chronic hyperinflation, nutritional status, and posture on respiratory muscle strength in cystic fibrosis. Am. Rev. Respir. Dis., 128, 1051- 4. [3.2, 12.4]

Otulana, B., Mist, B. A., Scott, J. P., 等 (1989) 受体肺大小对心肺移植后肺生理功能的影响。Transplantation, 48, 625-9. [Ex 12.5]
Otulana, B., Mist, B. A., Scott, J. P., et al. (1989) The effect of recipient lung size on lung physiology after heart- lung transplantation. Transplantation, 48, 625- 9. [Ex 12.5]

Owen, O. E., Kavle, E., Owen, R. S., 等 (1986) 健康女性热量需求的重新评估。Am. J. Clin. Nutr., 44, 1-19. [Ex 11.2]
Owen, O. E., Kavle, E., Owen, R. S., et al. (1986) A reappraisal of caloric requirements in healthy women. Am. J. Clin. Nutr., 44, 1- 19. [Ex 11.2]

Oye, R. K. 和 Shapiro, M. K. (1984) 化疗试验的报告。反应是否影响患者生存?J. Am. Med. Ass., 252, 2722-5. [13.5.3]
Oye, R. K. and Shapiro, M. K. (1984) Reporting from chemotherapy trials. Does response make a difference in patient survival? J. Am. Med. Ass., 252, 2722- 5. [13.5.3]

Peeters, P. H. M., Verbeek, A. L. M., Hendriks, J. H. C. L., 等 (1987) 尼梅亨项目中乳腺X线摄影筛查阳性结果的预测价值。Br. J. Cancer, 56, 667-71. [12.5.2]
Peeters, P. H. M., Verbeek, A. L. M., Hendriks, J. H. C. L., et al. (1987) The predictive value of positive test results in screening for breast cancer by mammography in the Nijmegen programme. Br. J. Cancer, 56, 667- 71. [12.5.2]

Peto, R., Gray, R., Collins, R., 等 (1988) 英国男医生预防性每日阿司匹林随机试验。Br. Med. J., 296, 313-16. [15.2.8]
Peto, R., Gray, R., Collins, R., et al. (1988) Randomised trial of prophylactic daily aspirin in British male doctors. Br. Med. J., 296, 313- 16. [15.2.8]

Peto, R., Pike, M. C., Armitage, P., 等 (1976) 需长期观察患者的随机临床试验设计与分析。I. 引言与设计。Br. J. Cancer, 34, 585-612. [13.1, 13.7, 15.1, 15.2.2]
Peto, R., Pike, M. C., Armitage, P., et al. (1976) Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br. J. Cancer, 34, 585- 612. [13.1, 13.7, 15.1, 15.2.2]

Peto, R., Pike, M. C., Armitage, P., 等 (1977) 需长期观察患者的随机临床试验设计与分析。II. 分析与实例。Br. J. Cancer, 35, 1-39. [13.1, 13.3.1, 13.5, 13.7]
Peto, R., Pike, M. C., Armitage, P., et al. (1977) Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. Analysis and examples. Br. J. Cancer, 35, 1- 39. [13.1, 13.3.1, 13.5, 13.7]

Pickering, G. (1978) 正常血压与高血压:虚假现象的神秘存活性。Am. J. Med., 65, 561-3. [14.5.5]
Pickering, G. (1978) Normotension and hypertension: the mysterious viability of the false. Am. J. Med., 65, 561- 3. [14.5.5]

Pisani, P., Berrino, F., Macaluso, M. 等(1986)胡萝卜、绿叶蔬菜和
Pisani, P., Berrino, F., Macaluso, M., et al. (1986) Carrots, green vegetables and

肺癌:一项病例对照研究。国际流行病学杂志,15,463-468。[5.10.1]
lung cancer: a case- control study. Int. J. Epidemiol., 15, 463- 8. [5.10.1]

Pocock, S. J.(1977)随机临床试验。英国医学杂志,i,1661。[15.2.4]
Pocock, S. J. (1977) Randomised clinical trials. Br. Med. J., i, 1661. [15.2.4]

Pocock, S. J. (1983) 临床试验:实用方法。奇切斯特,Wiley出版社。[5.14, 6.7, 15.1, 15.2.5, 15.2.11, 15.4.9]
Pocock, S. J. (1983) Clinical Trials: A Practical Approach. Chichester, Wiley. [5.14, 6.7, 15.1, 15.2.5, 15.2.11, 15.4.9]

Pocock, S. J. (1985) 临床试验设计与解释中的当前问题。英国医学杂志,290,39-42。[15.1]
Pocock, S.J. (1985) Current issues in the design and interpretation of clinical trials. Br. Med. J., 290, 39- 42. [15.1]

Prentice, A. M., Black, A. E., Coward, W. A., 等 (1986) 肥胖女性的高能量消耗水平。英国医学杂志,292,983-987。[9.6.1]
Prentice, A. M., Black, A. E., Coward, W. A., et al. (1986) High levels of energy expenditure in obese women. Br. Med. J., 292, 983- 7. [9.6.1]

Preston-Martin, S., Thomas, D. C., Wright, W. E. 和 Henderson, B. E. (1989) 洛杉矶县1978-1985年男性听神经瘤病因中的噪音创伤。英国癌症杂志,59,783-786。[示例 10.8]
Preston- Martin, S., Thomas, D. C., Wright, W. E. and Henderson, B. E. (1989) Noise trauma in the aetiology of acoustic neuromas in men in Los Angeles County, 1978- 1985. Br. J. Cancer, 59, 783- 6. [Ex 10.8]

Raftery, E. B. 和 Ward, A. P. (1968) 血压间接测量法。心血管研究,2,210-218。[示例 16.6]
Raftery, E. B. and Ward, A. P. (1968) The indirect method of recording blood pressure. Cardiovasc. Res., 2, 210- 18. [Ex 16.6]

Ramirez, A., Craig, T. K. J., Watson, J. P., 等 (1989) 压力与乳腺癌复发。英国医学杂志,298,291-293。[10.11.2]
Ramirez, A., Craig, T. K. J., Watson, J. P., et al. (1989) Stress and relapse of breast cancer. Br. Med. J., 298, 291- 3. [10.11.2]

Ramsdale, D. R., Faragher, E. B., Bennett, D. H., 等 (1982) 心脏瓣膜病患者术前显著冠状动脉疾病的预测。英国医学杂志,284,223-226。[示例 12.4]
Ramsdale, D. R., Faragher, E. B., Bennett, D. H., et al. (1982) Preoperative prediction of significant coronary artery disease in patients with valvular heart disease. Br. Med. J., 284, 223- 6. [Ex 12.4]

Rantakallio, P. 和 Mäkinen, H. (1984) 一岁时牙齿数量与母亲吸烟的关系。Ann. Hum. Biol., 11, 45-52. [3.3.3, 12.4.10]
Rantakallio, P. and Mäkinen, H. (1984) Number of teeth at the age of one year in relation to maternal smoking. Ann. Hum. Biol., 11, 45- 52. [3.3.3, 12.4.10]

Reading, V. M. 和 Weale, R. A. (1986) 眼睛疲劳与视觉显示设备。Lancet, i, 905-906. [10.8.1]
Reading, V. M. and Weale, R. A. (1986) Eye strain and visual display units. Lancet, i, 905- 6. [10.8.1]

Reisch, J. S., Tyson, J. E. 和 Mize, S. G. (1989) 治疗研究评价辅助工具。Pediatrics, 84, 815-827. [16.4]
Reisch, J. S., Tyson, J. E. and Mize, S. G. (1989) Aid to the evaluation of therapeutic studies. Pediatrics, 84, 815- 27. [16.4]

Richmond, R. L., Austin, A. 和 Webster, I. W. (1988) 通过全科医生实施的戒烟项目中预测戒烟者。Int. J. Epidemiol., 17, 530-534. [12.5.2]
Richmond, R. L., Austin, A. and Webster, I. W. (1988) Predicting abstainers in a smoking cessation programme administered by general practitioners. Int. J. Epidemiol., 17, 530- 4. [12.5.2]

Roberts, R. S., Spitzer, W. O., Delmore, T. 和 Sackett, D. L. (1978) Berkson 偏倚的实证示范。J. Chron. Dis., 31, 119-128. [5.10.1]
Roberts, R. S., Spitzer, W. O., Delmore, T. and Sackett, D. L. (1978) An empirical demonstration of Berkson's bias. J. Chron. Dis., 31, 119- 28. [5.10.1]

Rockwood, K., Stolee, P., Robertson, D. 和 Shillington, E. R. (1989) 老年人健康状况调查中的反应偏倚。Age Ageing, 18, 177-182. [5.12.2]
Rockwood, K., Stolee, P., Robertson, D. and Shillington, E. R. (1989) Response bias in a health status survey of elderly people. Age Ageing, 18, 177- 82. [5.12.2]

Rosenthal, F. S., Bakalian, A. E., Lou, C. 和 Taylor, H. R. (1988) 太阳镜对眼睛紫外线暴露的影响。Am. J. Public Health, 78, 72-74. [9.8.7]
Rosenthal, F. S., Bakalian, A. E., Lou, C. and Taylor, H. R. (1988) The effect of sunglasses on ocular exposure to ultraviolet radiation. Am. J. Public Health, 78, 72- 4. [9.8.7]

Roth, J. A., Eilber, F. R., Nizze, J. A. 和 Morton, D. L. (1975) 癌症患者对二硝基氯苯和蓖麻油皮肤反应无相关性。N. Engl. J. Med., 293, 388-389. [Ex 10.1]
Roth, J. A., Eilber, F. R., Nizze, J. A. and Morton, D. L. (1975) Lack of correlation between skin reactivity to dinitrochlorobenzene and croton oil in patients with cancer. N. Engl. J. Med., 293, 388- 9. [Ex 10.1]

Royston, J. P. (1983) 评估 Shapiro-Francia 非正态性检验的简便方法。Statistician, 32, 297-300. [11.6, 附录 B]
Royston, J. P. (1983) A simple method for evaluating the Shapiro- Francia test for non- Normality. Statistician, 32, 297- 300. [11.6, App B]

Royston, J. P., Flecknell, P. A. 和 Wootton, R. (1982) 新证据表明宫内生长受限的仔猪属于一个独立亚群。Biol. Neonate, 42, 100-104. [7.5.3]
Royston, J. P., Flecknell, P. A. and Wootton, R. (1982) New evidence that the intra- uterine growth- retarded piglet is a member of a discrete subpopulation. Biol. Neonate, 42, 100- 4. [7.5.3]

Sackett, D. L. (1979) 分析性研究中的偏倚。慢性疾病杂志,32,51-63。[5.10]
Sackett, D. L. (1979) Bias in analytic research. J. Chron. Dis., 32, 51- 63. [5.10]

Sackett, D. L. (1986) 神经科学中的合理治疗:随机试验的作用。卒中,17,1323-1329。[5.]
Sackett, D. L. (1986) Rational therapy in the neurosciences: the role of the randomized trial. Stroke, 17, 1323- 9. [5. ]

Sackett, D. L. 和 Gent, M. (1979) 临床试验中事件计数和归因的争议。新英格兰医学杂志,301,1410-1412。[15.4.5]
Sackett, D. L. and Gent, M. (1979) Controversy in counting and attributing events in clinical trials. N. Engl. J. Med., 301, 1410- 12. [15.4.5]

Sacks, H. S., Chalmers, T. C. 和 Smith, H. (1983) 临床试验的敏感性和特异性:随机对照与历史对照。内科档案,143,753-755。[15.2.4]
Sacks, H. S., Chalmers, T. C. and Smith, H. (1983) Sensitivity and specificity of clinical trials: randomized v historical controls. Arch. Intern. Med., 143, 753- 5. [15.2.4]

Sankaranarayanan, R., Mohideen, M. N., Nair, M. K. 和 Padmanabhan, T. K. (1989) 30岁及以下患者口腔癌的病因。英国癌症杂志,59,
Sankaranarayanan, R., Mohideen, M. N., Nair, M. K. and Padmanabhan, T. K. (1989) Aetiology of oral cancer in patients ≤ 30 years of age. Br. J. Cancer, 59,

439-440。[示例10.6] Schiff, E., Peleg, E., Goldenberg, M. 等 (1989) 阿司匹林用于预防妊娠高血压及降低相对高风险妊娠中血栓素与前列环素的比率。新英格兰医学杂志,321,351-356。[示例10.7, 示例15.5] Schlesselman, J. J. (1982) 病例对照研究。设计、实施、分析,牛津:大学出版社。[5.10.6, 5.14] Schoenfeld, D. A. 和 Richter, J. R. (1982) 计算以生存为终点的临床试验所需患者数的列线图。生物统计学,38,163-170。[13.7] Schoolman, H. M., Becktel, J. M., Best, W. R. 和 Johnson, A. F. (1968) 医学研究中的统计学:原则与实践。实验室临床医学杂志,71,357-367。[1.3, 9.] Schor, S. 和 Karten, I. (1966) 医学期刊稿件的统计学评估。美国医学会杂志,195,1123-1128。[16.3.1] Seddon, H. J. (1937) 临床证据与统计证明。柳叶刀,ii,412。[1.5] Seely, S. (1985) 猪肉消费与肝硬化的关系。柳叶刀,i,925。[11.8] Shaper, A. G., Pocock, S. J., Phillips, A. N. 和 Walker, M. (1986) 识别心脏病高危男性:全科医疗中的策略。英国医学杂志,293,474-479。[1.4.1, 12.5.2] Shapiro, C. M., Beckmann, E., Christiansen, N. 等 (1986) 霍奇金病和播散性恶性肿瘤缓解患者的免疫状态。美国医学科学杂志,293,366-370。[7.3, 9.7, 9.7.1] Sheps, S. B. 和 Schechter, M. T. (1984) 诊断测试评估。当前医学研究综述。美国医学会杂志,252,2418-2422。[14.4.8, 16.3.2, 16.4] Silman, A. J. (1985) 随机零血压计在全科医疗中的失败。英国医学杂志,290,1781-1782。[7.7.1] Silverman, W. A. (1985) 人体实验:引导进入未知,牛津:大学出版社。[15.2.1, 15.2.9] Simon, R. (1986) 临床试验结果报告的置信区间。内科年鉴,105,429-435。[13.4.5, 13.4.7] Simon, R. 和 Makuch, R. W. (1984) 生存与事件发生关系的非参数图示表示:应用于响应者与非响应者偏倚。统计医学,3,35-44。[13.5.3] Simon, R. 和 Wittes, R. E. (1985) 临床试验报告的方法学指南。癌症治疗报告,69,1-3。[15.6.1, 16.4] Sivell, L. M. 和 Wenlock, R. W. (1983) 英国面包的营养成分:伦敦地区研究。人体营养:应用营养,37A,459-469。[3.7.3] Smith, D. G., Clemens, J., Crede, W. 等 (1987) 随机临床试验中多重比较的影响。美国医学杂志,83,545-550。[15.4.7] Smith, R. (1981) 消费与损害的关系。英国医学杂志,283,895-898。[11.8] Smithells, R. W., Sheppard, S., Schorah, C. J. 等 (1980) 围孕期维生素补充可能预防神经管缺陷。柳叶刀,i,339-340。[15.2.4, 15.2.9] Snedecor, G. W. (1950) 科学方法中的统计部分。纽约科学院年鉴,52,792-799。[8.] Solberg, H. E. (1987) 关于参考值理论的批准建议(1986)。第一部分。参考值的概念。临床化学与临床生物化学杂志,25,337-342。[14.5]
439- 40. [Ex 10.6] Schiff, E., Peleg, E., Goldenberg, M., et al. (1989) The use of aspirin to prevent pregnancy- induced hypertension and lower the ratio of thromboxane to prostacyclin in relatively high risk pregnancies. N. Engl. J. Med., 321, 351- 6. [Ex 10.7, Ex 15.5] Schlesselman, J. J. (1982) Case- control Studies. Design, conduct, analysis, Oxford: University Press. [5.10.6, 5.14] Schoenfeld, D. A. and Richter, J. R. (1982) Nomograms for calculating the number of patients needed for a clinical trial with survival as an endpoint. Biometrics, 38, 163- 70. [13.7] Schoolman, H. M., Becktel, J. M., Best, W. R. and Johnson, A. F. (1968) Statistics in medical research: principles versus practices. J. Lab. Clin. Med., 71, 357- 67. [1.3, 9. ] Schor, S. and Karten, I. (1966) Statistical evaluation of medical journal manuscripts. J. Am. Med. Ass., 195, 1123- 8. [16.3.1] Seddon, H. J. (1937) Clinical evidence and statistical proof. Lancet, ii, 412. [1.5] Seely, S. (1985) Relation between pork consumption and cirrhosis. Lancet, i, 925. [11.8] Shaper, A. G., Pocock, S. J., Phillips, A. N. and Walker, M. (1986) Identifying men at high risk of heart attacks: strategy for use in general practice. Br. Med. J., 293, 474- 9. [1.4.1, 12.5.2] Shapiro, C. M., Beckmann, E., Christiansen, N., et al. (1986) Immunologic status of patients in remission from Hodgkin's disease and disseminated malignancies. Am. J. Med. Sci., 293, 366- 70. [7.3, 9.7, 9.7.1] Sheps, S. B. and Schechter, M. T. (1984) The assessment of diagnostic tests. A survey of current medical research. J. Am. Med. Ass., 252, 2418- 22. [14.4.8, 16.3.2, 16.4] Silman, A. J. (1985) Failure of random zero sphygmomanometer in general practice. Br. Med. J., 290, 1781- 2. [7.7.1] Silverman, W. A. (1985) Human Experimentation: a guided step into the unknown, Oxford: University Press. [15.2.1, 15.2.9] Simon, R. (1986) Confidence intervals for reporting results of clinical trials. Ann. Intern. Med., 105, 429- 35. [13.4.5, 13.4.7] Simon, R. and Makuch, R. W. (1984) A non- parametric graphical representation of the relationship between survival and the occurrence of an event: application to responder versus non- responder bias. Stat. Med., 3, 35- 44. [13.5.3] Simon, R. and Wittes, R. E. (1985) Methodologic guidelines for reports of clinical trials. Cancer Treat. Rep., 69, 1- 3. [15.6.1, 16.4] Sivell, L. M. and Wenlock, R. W. (1983) The nutritional composition of British bread: London area study. Hum. Nut.: Appl. Nut., 37A, 459- 69. [3.7.3] Smith, D. G., Clemens, J., Crede, W., et al. (1987) Impact of multiple comparisons in randomized clinical trials. Am. J. Med., 83, 545- 50. [15.4.7] Smith, R. (1981) The relation between consumption and damage. Br. Med. J., 283, 895- 8. [11.8] Smithells, R. W., Sheppard, S., Schorah, C. J., et al. (1980) Possible prevention of neural- tube defects by periconceptional vitamin supplementation. Lancet, i, 339- 40. [15.2.4, 15.2.9] Snedecor, G. W. (1950) The statistical part of the scientific method. Ann. N.Y. Acad. Sci., 52, 792- 9. [8. ] Solberg, H. E. (1987) Approved recommendation (1986) on the theory of reference values. Part 1. The concept of reference values. J. Clin. Chem. Clin. Biochem., 25, 337- 42. [14.5]

Sprent, P. (1989) 应用非参数统计方法,伦敦:Chapman and Hall,142-155。[11.14]
Sprent, P. (1989) Applied Nonparametric Statistical Methods, London: Chapman and Hall, 142- 55. [11.14]

Stacpoole, P. W., Lorenz, A. C., Thomas, R. G. 和 Harman, E. M. (1988) 二氯乙酸治疗乳酸酸中毒。内科年鉴,108,58-63。[示例11.1]
Stacpoole, P. W., Lorenz, A. C., Thomas, R. G. and Harman, E. M. (1988) Dichloroacetate in the treatment of lactic acidosis. Ann. Intern. Med., 108, 58- 63. [Ex 11.1]

Storr, J., Barrell, E., Barry, W. 等 (1987) 单次口服泼尼松龙对急性儿童哮喘的影响。柳叶刀,i,879-882。[示例10.3]
Storr, J., Barrell, E., Barry, W., et al. (1987) Effect of a single oral dose of prednisolone in acute childhood asthma. Lancet, i, 879- 82. [Ex 10.3]

Stuart, J., Stone, P. C. W., Freyburger, G. 等 (1989) 仪器精度与生物变异性决定血流变学研究所需患者数。临床血流变学,9,181-197。[16.3.2]
Stuart, J., Stone, P. C. W., Freyburger, G., et al. (1989) Instrument precision and biological variability determine the number of patients required for rheological studies. Clin. Hemorheol., 9, 181- 97. [16.3.2]

Thakur, C. P. 和 Sharma, D. (1984) 满月与犯罪。英国医学杂志,289,1789-1791。[4.8]
Thakur, C. P. and Sharma, D. (1984) Full moon and crime. Br. Med. J., 289, 1789- 91. [4.8]

Thomas, E. J. 和 Cooke, I. D. (1987) Gestrinone 对无症状子宫内膜异位症进程的影响。英国医学杂志,294,272-274。[Ex 9.7]
Thomas, E. J. and Cooke, I. D. (1987) Impact of gestrinone on the course of asymptomatic endometriosis. Br. Med. J., 294, 272- 4. [Ex 9.7]

Thompson, E. M., Price, A. B., Altman, D. G., 等 (1985) 利用计算机交互式图像分析对炎症性肠病的定量研究。临床病理学杂志,38,631-638。[12.6]
Thompson, E. M., Price, A. B., Altman, D. G., et al. (1985) Quantitation in inflammatory bowel disease using computerised interactive image analysis. J. Clin. Pathol., 38, 631- 8. [12.6]

Thuesen, L., Christiansen, J. S., Falstie-Jensen, N., 等 (1985) 短期1型糖尿病患者心肌收缩力增加:超声心动图研究。糖尿病学,28,822-826。[8.6, 11.6, 11.10]
Thuesen, L., Christiansen, J. S., Falstie- Jensen, N., et al. (1985) Increased myocardial contractility in short- term Type 1 diabetic patients: an echocardiographic study. Diabetologia, 28, 822- 6. [8.6, 11.6, 11.10]

Tibshirani, R. (1982) 简明比例风险模型指南。临床医学调查,5,63-68。[13.6.3]
Tibshirani, R. (1982) A plain man's guide to the proportional hazards model. Clin. Invest. Med., 5, 63- 8. [13.6.3]

Toulon, P., Jacquot, C., Capron, L., 等 (1987) 慢性肾衰竭患者定期血液透析中抗凝血酶III和肝素因子II的研究。血栓与止血,57,263-268。[7.3, Ex 9.6]
Toulon, P., Jacquot, C., Capron, L., et al. (1987) Antithrombin III and heparin cofactor II in patients with chronic renal failure undergoing regular haemodialysis. Thromb. Haemostas., 57, 263- 8. [7.3, Ex 9.6]

Tufte, E. R. (1983) 《定量信息的视觉展示》,康涅狄格州切舍尔:Graphics Press出版社。[3.7.3, 16.3.5]
Tufte, E. R. (1983) The Visual Display of Quantitative Information, Cheshire, Conn.: Graphics Press. [3.7.3, 16.3.5]

Tukey, J. W. (1977) 《探索性数据分析》,马萨诸塞州雷丁:Addison-Wesley出版社。[3.7.3]
Tukey, J. W. (1977) Exploratory Data Analysis, Reading, Mass.: Addison- Wesley. [3.7.3]

Tyson, J. E., Furzan, J. A., Reisch, J. S. 和 Mize, S. G. (1983) 围产医学治疗研究质量评估。儿科学杂志,102,10-13。[16.3.1]
Tyson, J. E., Furzan, J. A., Reisch, J. S. and Mize, S. G. (1983) An evaluation of the quality of therapeutic studies in perinatal medicine. J. Pediatr., 102, 10- 13. [16.3.1]

Ueshima, H., Ogihara, T., Baba, S., 等 (1987) 降低酒精摄入对血压的影响:一项随机、对照、单盲研究。人类高血压杂志,1,113-119。[15.4.1, 15.4.10]
Ueshima, H., Ogihara, T., Baba, S., et al. (1987) The effect of reduced alcohol consumption on blood pressure: a randomised, controlled, single blind study. J. Hum. Hypertension, 1, 113- 19. [15.4.1, 15.4.10]

Wainer, H. (1984) 如何糟糕地展示数据。美国统计学家,38,137-147。[16.3.5]
Wainer, H. (1984) How to display data badly. Am. Stat., 38, 137- 47. [16.3.5]

Wald, N. 和 Cuckle, H. (1989) 筛查和诊断测试评估的报告。英国妇产科杂志,96,389-396。[16.4]
Wald, N. and Cuckle, H. (1989) Reporting the assessment of screening and diagnostic tests. Br. J. Obstet. Gynaecol., 96, 389- 96. [16.4]

Weindling, A. M., Bamford, F. N. 和 Whittall, R. A. (1986) 青少年违法者的健康状况。英国医学杂志,292,447。[10.7.2]
Weindling, A. M., Bamford, F. N. and Whittall, R. A. (1986) Health of juvenile delinquents. Br. Med. J., 292, 447. [10.7.2]

Weiss, S. H., Goedert, J. J., Sarngadharan, M. G., 等 (1985) HTLV-III(艾滋病病原体)抗体筛查测试。美国医学会杂志,253,221-225。[14.4.4, 14.4.8]
Weiss, S. H., Goedert, J. J., Sarngadharan, M. G., et al. (1985) Screening test for HTLV- III (AIDS agent) antibodies. J. Am. Med. Ass., 253, 221- 5. [14.4.4, 14.4.8]

Welle, S. L., Seaton, T. B. 和 Campbell, R. G. (1986) 人类过度饮食的一些代谢效应。美国临床营养学杂志,44,718-724。[示例 12.1]
Welle, S. L., Seaton, T. B. and Campbell, R. G. (1986) Some metabolic effects of overeating in man. Am. J. Clin. Nutr., 44, 718- 24. [Ex 12.1]

Weller, M. P. I. 和 Weller, B. (1986) 犯罪与精神病理学。英国医学杂志,292,55-56。[5.13]
Weller, M. P. I. and Weller, B. (1986) Crime and psychopathology. Br. Med. J., 292, 55- 6. [5.13]

Williams, C. J., Davies, C., Raval, M., 等 (1989) 抗呕吐治疗开始时间为化疗前24小时或同时的比较。英国医学杂志,298,430-431。[示例 9.4]
Williams, C. J., Davies, C., Raval, M., et al. (1989) Comparison of starting antiemetic treatment 24 hours before or concurrently with cytotoxic chemotherapy. Br. Med. J., 298, 430- 1. [Ex 9.4]

Woods, J. R., Williams, J. G. 和 Tavel, M. (1989) 医学研究中的两期交叉设计。内科年鉴,110,560-566。[15.2.5]
Woods, J. R., Williams, J. G. and Tavel, M. (1989) The two- period crossover design in medical research. Ann. Intern. Med., 110, 560- 6. [15.2.5]

Yusuf, S., Collins, R., Peto, R., 等 (1985) 急性心肌梗死中静脉和冠状动脉内纤溶治疗:33项随机对照试验中死亡率、再梗死及副作用结果综述。欧洲心脏杂志,6,556-585。[15.5.2]
Yusuf, S., Collins, R., Peto, R., et al. (1985) Intravenous and intracoronary fibrinolytic therapy in acute myocardial infarction: overview of results on mortality, reinfarction and side- effects from 33 randomized controlled trials. Eur. Heart J., 6, 556- 85. [15.5.2]

Zelen, M. (1979) 一种新的随机临床试验设计。新英格兰医学杂志,300,1242-1245。[15.2.5]
Zelen, M. (1979) A new design for randomized clinical trials. N. Engl. J. Med., 300, 1242- 5. [15.2.5]

Zhang, Y., Nitter-Hauge, S., Ihlen, H. 等 (1986) 利用多普勒超声心动图测量主动脉瓣返流。英国心脏杂志,55,32-38。[14.2.1]
Zhang, Y., Nitter- Hauge, S., Ihlen, H., et al. (1986) Measurement of aortic regurgitation by Doppler echocardiography. Br. Heart J., 55, 32- 8. [14.2.1]

Zweig, J. P. 和 Csank, J. Z. (1978) 泊松模型在蒙特利尔六家医院每日死亡人数年度分布中的应用。流行病学与社区卫生杂志,32,206-211。[4.8]
Zweig, J. P. and Csank, J. Z. (1978) The application of a Poisson model to the annual distribution of daily mortality at six Montreal hospitals. J. Epidem. Comm. Hlth., 32, 206- 11. [4.8]

索引 Index

前缀 N 指附录 A 中的数学符号条目 The prefix N refers to an entry in Appendix A on mathematical notation

异常病理 409-13,417 绝对值 187,N509 吸收测定法 279 听神经瘤 275 生命表法 371 自适应设计 449 调整后的 345-6,351 促肾上腺皮质激素 442 不良反应 45 年龄 21,390,392,461 与血压 318 与体脂百分比 278,282,283,286,294,298 妊娠期 42,266,310,326,425 初潮年龄 287,298 与参考区间 423-6 与交通事故 24-6 - 性别登记 7 乳牙萌出时 31 分类评估间一致性 405-8 测量方法间一致性 277,284,396-403,484,486 观察者间一致性 397,403-9 评分者间一致性 397,403-9 比例一致性 404,407,409 艾滋病 413 白蛋白 52,54,56,57-8,136,139,148,155,160,163-4,166-7,288,390 酒精摄入 91,103,275,297,469 醛固酮 503 算法 111-12,120 过敏 405 甲胎蛋白 94,419
Abnormal pathology 409- 13, 417 Absolute value 187, N509 Absorptiometry 279 Acoustic neuroma 275 Actuarial method 371 Adaptive designs 449 Adjusted 345- 6, 351 Adrenocorticotrophic hormone 442 Adverse reactions 45 Age 21, 390, 392, 461 and blood pressure 318 and body fat 278, 282, 283, 286, 294, 298 gestational 42, 266, 310, 326, 425 of menarche 287, 298 and reference interval 423- 6 and road accidents 24- 6 - sex register 7 at tooth eruption 31 Agreement between categorical assessments 405- 8 between methods of measurement 277, 284, 396- 403, 484, 486 between observers 397, 403- 9 inter- rater 397, 403- 9 proportional 404, 407, 409 AIDS 413 Albumin 52, 54, 56, 57- 8, 136, 139, 148, 155, 160, 163- 4, 166- 7, 288, 390 Alcohol consumption 91, 103, 275, 297, 469 Aldosterone 503 Algorithm 111- 12, 120 Allergy 405 Alpha- fetoprotein 94, 419

α错误 见 I 型错误 阿普洛洛尔 474 交替分配 446,485,494 备择假设 165,168 海拔 503 氨溴索 458 动态血压监测 79 羊膜穿刺 419 贫血 413 镇痛剂 83,430,436 协方差分析 309,318,339,465 数据分析 3,4,5,8,38,113-14,118-19 方法选择 179-80,189 验证性分析 338 探索性分析 113,121,174,282,298,338,359 与设计相关 5,179-80,426,430 与数据类型相关 10,17,180 分析策略 112-14,175-6 方差分析与线性回归 297,308 与多元回归 325-6 多重比较 210-13,215 单因素 205-13,221,328,333,426 假设 182,206 Kruskal-Wallis 检验 213-15,265,335 线性趋势 212-13,215-16,219,318-19 数学原理 218-20 非参数 213-15,265,335 有序组 212-13,215-16,219,318-19 结果呈现 220-2
Alpha error see type I error Alprenolol 474 Alternate allocation 446, 485, 494 Alternative hypothesis 165, 168 Altitude 503 Ambroxol 458 Ambulatory blood pressure monitoring 79 Amniocentesis 419 Anaemia 413 Analgesic 83, 430, 436 Analysis of covariance 309, 318, 339, 465 Analysis of data 3, 4, 5, 8, 38, 113- 14, 118- 19 choice of method 179- 80, 189 confirmatory 338 exploratory 113, 121, 174, 282, 298, 338, 359 relation to design 5, 179- 80, 426, 430 relation to type of data 10, 17, 180 strategy for 112- 14, 175- 6 Analysis of variance and linear regression 297, 308 and multiple regression 325- 6 multiple comparisons in 210- 13, 215 one way 205- 13, 221, 328, 333, 426 assumptions 182, 206 Kruskal- Wallis 213- 15, 265, 335 linear trend 212- 13, 215- 16, 219, 318- 19 mathematics 218- 20 non- parametric 213- 15, 265, 335 ordered groups 212- 13, 215- 16, 219, 318- 19 presentation of results 220- 2

方差分析(续)
Analysis of variance (cont.)

单因素(续) 使用均值和标准差 219 使用原始数据 219 残差 207,330,334 双因素 325-36 假设 330-1,334 Friedman's 非参数检验 334-6 拟合优度 330 心绞痛 2,7 动物实验 83,90 方差分析见 Analysis of variance 抗惊厥治疗 97 抗高血压治疗 50,124,125,451,458,471-2 戒烟建议 357 抗生素 438 抗呕吐治疗 225 反对数 37,62,267,401,424,N511 主动脉瓣疾病 397,400 Apgar 评分 15,172,180 与生长迟缓 267 臂围 80,397 曲线下面积(AUC) 430,431-3 类风湿性关节炎 45,274-5,436,465 关节炎与风湿病 479 确认偏倚 95 阿司匹林与妊娠期高血压 275,476 与心肌梗死 451 与瘙痒 474 与中风 451 关联 276 与一致性 284,401 与因果关系 96,102-3,247,297-8,321,467,490 虚假关联 283,285 见相关性 哮喘 66,466 儿童 273-4 女性 157,162,165,230,231,232 宇航员 223 平均值 21-2 航空事故 19,47
one way (cont.) using means and SDs 219 using raw data 219 residuals from 207, 330, 334 two way 325- 36 assumptions 330- 1, 334 Friedman's (non- parametric) 334- 6 goodness- of- fit 330 Angina 2, 7 Animal experiments 83, 90 Anova see Analysis of variance Anti- convulsant therapy 97 Anti- hypertensive treatment 50, 124, 125, 451, 458, 471- 2 Anti- smoking advice 357 Antibiotic 438 Antiemetic treatment 225 Antilogarithm 37, 62, 267, 401, 424, N511 Aortic valve disease 397, 400 Apgar score 15, 172, 180 and growth retardation 267 Arm circumference 80, 397 Area under the curve (AUC) 430, 431- 3 Arthritis, rheumatoid 45, 274- 5, 436, 465 Arthritis and Rheumatism 479 Ascertainment bias 95 Aspirin and hypertension in pregnancy 275, 476 and myocardial infarction 451 and pruritus 474 and stroke 451 Association 276 and agreement 284, 401 and causality 96, 102- 3, 247, 297- 8, 321, 467, 490 spurious 283, 285 see also Correlation Asthma 66, 466 in children 273- 4 in women 157, 162, 165, 230, 231, 232 Astronauts 223 Average 21- 2 Aviation accidents 19, 47

硫唑嘌呤 148,389-90,391
Azathioprine 148, 389- 90, 391

婴儿性别 49,71-2,177 背转化 37,38,61,202,421 背痛 38-9,77 平衡设计 80 条形图(例如 x) 22,N509 条形图示 19,24,39 Bartlett 检验 207-8 基线特征 6,38-9,90,461-2,464-5,473 批次 90 贝叶斯定理 415 卧床时间 274 啤酒 101,102 Berkson,J. 93 Berkson 偏倚 93 β阻滞剂 386,471-2,474 β错误见 II 型错误 槟榔咀嚼 275 分析偏倚 386,387 确认偏倚 95 避免偏倚 7 设计中 7,81,93,94,96,102 方法比较中偏倚检测 95 预防 85-6,441,494 发表偏倚 169-70,472-3,483 回忆偏倚 94-5 抽样偏倚 7 监测偏倚 99 志愿者偏倚 100,446,484 见临床试验偏倚 胆红素 60-2,136,143,148,157,164,390,392 二元变量 10,339,351-2,359,414,458 二项分布 63-6,68-70,155,157,186,230,231 正态近似 66,155,157,161,186,230,231,239,459 配对比例的正态近似 239-40 与符号检验 240 生物利用度 432 活检 359 生物素 435-6
Babies, sex of 49, 71- 2, 177 Back- transformation 37, 38, 61, 202, 421 Back pain 38- 9, 77 Balanced design 80 Bar (e.g. x) 22, N509 Bar diagram 19, 24, 39 Bartlett's test 207- 8 Baseline characteristics 6, 38- 9, 90, 461- 2, 464- 5, 473 Batches 90 Bayes' theorem 415 Bed, hours spent in 274 Beer 101, 102 Berkson, J. 93 Berksons's bias 93 Beta- blocker 386, 471- 2, 474 Beta error see Type II error Betel chewing 275 Bias in analysis 386, 387 ascertainment 95 avoidance of 7 in design 7, 81, 93, 94, 96, 102 detection of 95 in method comparison 398, 402 prevention of 85- 6, 441, 494 publication 169- 70, 472- 3, 483 recall 94- 5 sampling 7 surveillance 99 volunteer 100, 446, 484 see also Clinical trial, bias in Bilirubin 60- 2, 136, 143, 148, 157, 164, 390, 392 Binary variable 10, 339, 351- 2, 359, 414, 458 Binomial distribution 63- 6, 68- 70, 155, 157, 186, 230, 231 Normal approximation to 66, 155, 157, 161, 186, 230, 231, 239, 459 for paired proportions 239- 40 and sign test 240 Bioavailability 432 Biopsy 359 Biotin 435- 6

出生日期 446 出生月份 434 出生星座 466 出生体重 2,6,78,98,174,319 百分位数 425 与妊娠期长度 266,310,326,425 与基础代谢率 322 与父母出生体重 337 猪仔 142 盲法 82 见临床试验;盲法 失明(视力缺失) 442 血液 90 献血者 413,437 血型 49,63-6,68-9,71 血压 2,124,125,132,146,147,173-4,284,318,397,423,451,458,475 昼夜节律 148,433,434 两组比较 79-82,83,84,86,90,168,333 测量方法 12,146,147 妊娠期 426,501 变异性 35,72-3,78,82 见高血压 血液输注 419 血液黏度 288,348-9 体脂百分比 278-9,282,283,286,294,298 体质指数(BMI) 125 体脂质量百分比(BMP) 342-5,347,349 体温 78 见身高;体重;肥胖 骨髓移植 361,395 Bonferroni 法(校正) 211,261,329,465 箱线图 33,39,62 括号 N506-7,N510 麸皮 504 面包消费 43 乳腺癌 66,89,444 辅助化疗 452,486 与避孕药 1 与阳性淋巴结 89,375,382,462
Birth date of 446 month of 434 Birth sign 466 Birth weight 2, 6, 78, 98, 174, 319 centiles 425 and length of gestation 266, 310, 326, 425 and metabolic rate 322 and parental birth weight 337 piglets 142 Blinding 82 see also Clinical trial; Blindness Blindness (lack of sight) 442 Blood 90 Blood donors 413, 437 Blood group 49, 63- 6, 68- 9, 71 Blood pressure 2, 124, 125, 132, 146, 147, 173- 4, 284, 318, 397, 423, 451, 458, 475 circadian rhythm of 148, 433, 434 comparison of two arms 79- 82, 83, 84, 86, 90, 168, 333 measurement of 12, 146, 147 in pregnancy 426, 501 variability of 35, 72- 3, 78, 82 see also Hypertension Blood transfusion 419 Blood viscosity 288, 348- 9 Body fat percentage 278- 9, 282, 283, 286, 294, 298 mass index (BMI) 125 mass percentage (BMP) 342- 5, 347, 349 temperature 78 see also Height; Weight; Obesity Bone marrow transplantation 361, 395 Bonferroni method (correction) 211, 261, 329, 465 Box- and- whisker plot 33, 39, 62 Brackets N506- 7, N510 Bran 504 Bread consumption 43 Breast cancer 66, 89, 444 adjuvant chemotherapy for 452, 486 and the contraceptive pill 1 and positive nodes 89, 375, 382, 462

乳腺癌(续) 复发 269-70 筛查 356-7 治疗类型 365,452,489 英国医学杂志 465,473,488,494 剖宫产 102 与孕妇鞋码 229,261-5,319 咖啡因 98,242-4,247-8,249,265 计算器 17,35,112,175,256 癌症 94,102,393 晚期 447 乳腺癌见 Breast cancer 宫颈癌 50,95 与氟化物 1,90 肝癌 96 肺癌 51,91,93 口腔癌 275 复发 269-70,393 登记 66 皮肤癌 272 大麻 103 典型变量 359 卡托普利 475 心脏搭桥手术 207 心血管疾病 7 龋齿 438 胡萝卜 93 病例对照研究 50,74,76,91,93-6,102,266,268-70,494 确认偏倚 95 匹配 94,189,269 回忆偏倚 94-5 病例选择 94 对照选择 93-4,484 分类数据 10-11,117,123,229 多元回归中的分类数据分析 339,351 因果关系 91,96,102-3,247,297-8,321,467,490 截尾观察 16,22,365,369,370,378,385,394 人口普查 74 百分位数 31-3,37,58,221,358,420-1 置信区间 422-3 中枢性胆汁淤积 390
Breast cancer (cont.) recurrence of 269- 70 screening for 356- 7 type of treatment 365, 452, 489 British Medical Journal 465, 473, 488, 494 Caesarean section 102 and maternal shoe size 229, 261- 5, 319 Caffeine 98, 242- 4, 247- 8, 249, 265 Calculator 17, 35, 112, 175, 256 Cancer 94, 102, 393 advanced 447 breast see Breast cancer cervical 50, 95 and fluoride 1, 90 liver 96 lung 51, 91, 93 oral 275 recurrence 269- 70, 393 registrations 66 skin 272 Cannabis 103 Canonical variate 359 Captopril 475 Cardiac bypass surgery 207 Cardiovascular disease 7 Caries 438 Carrots 93 Case control study 50, 74, 76, 91, 93- 6, 102, 266, 268- 70, 494 ascertainment bias 95 matched 94, 189, 269 recall bias 94- 5 selection of cases 94 selection of controls 93- 4, 484 Categorical data 10- 11, 117, 123, 229 analysis of 229- 72 in multiple regression 339, 351 Causal link 91, 96, 102- 3, 247, 297- 8, 321, 467, 490 Censored observations 16, 22, 365, 369, 370, 378, 385, 394 Census 74 Centile 31- 3, 37, 58, 221, 358, 420- 1 confidence interval for 422- 3 Central cholestasis 390

中心极限定理 154, 164, 173, 177, 181 中央范围 33, 37, 57 宫颈骨关节病 233, 252 宫颈涂片 95 基线变化 430, 466 随时间变化 101-2 与初始值的关系 284-5 结果检查 114 检查清单 494-7 临床试验用 473, 491, 494, 496-7 一般医学论文用 495 化疗 225, 447, 452, 486 卡方分布 214, 244-6, 247, 248, 252, 259, 261, 335, 373, 381, 382, 383, N512 与正态分布的关系 244-6, 258 卡方表 523 卡方检验 241-65, 271, 467, 491 自由度 245, 246-7 解释 247-8 展示 271 样本量 248, 253 趋势检验 261-5, 319 表 249, 250-3, 254, 257-9, 260 与比例比较的等价性 257-8, 259, 271 Yates校正 252-3, 260 表 259-65 有序组 261-5 无序组 259-61 表 266 表 242-4, 247-9, 265 氯 250-2, 269 胆固醇 2, 59, 165, 166, 285 香烟 13, 351 见吸烟 昼夜变异 79, 148, 434 周长缩短 300-3, 306-9, 310-18, 320, 323 肝硬化 96, 297-8, 390 见原发性胆汁性肝硬化 临床重要性 170, 297, 455-6, 457, 461, 464
Central limit theorem 154, 164, 173, 177, 181 Central range 33, 37, 57 Cervical osteoarthrosis 233, 252 Cervical smear 95 Change from baseline 430, 466 over time 101- 2 relation to initial value 284- 5 Checking results 114 Checklists 494- 7 for clinical trials 473, 491, 494, 496- 7 for general medical papers 495 Chemotherapy 225, 447, 452, 486 Chi squared distribution 214, 244- 6, 247, 248, 252, 259, 261, 335, 373, 381, 382, 383, N512 relation to Normal distribution 244- 6, 258 table of 523 Chi squared test 241- 65, 271, 467, 491 degrees of freedom 245, 246- 7 interpretation 247- 8 presentation 271 sample size 248, 253 for trend 261- 5, 319 table 249, 250- 3, 254, 257- 9, 260 equivalence to comparison of proportions 257- 8, 259, 271 Yates' correction 252- 3, 260 table 259- 65 ordered groups 261- 5 unordered groups 259- 61 table 266 table 242- 4, 247- 9, 265 Chlorine 250- 2, 269 Cholesterol 2, 59, 165, 166, 285 Cigarettes 13, 351 see also Smoking Circadian variation 79, 148, 434 Circumferential shortening 300- 3, 306- 9, 310- 18, 320, 323 Cirrhosis 96, 297- 8, 390 see also Primary biliary cirrhosis Clinical importance 170, 297, 455- 6, 457, 461, 464

临床实践 5 临床研究 8 临床试验 76, 102, 167, 440-74 自适应设计 449 其他变量调整 375, 464-5 交替分配 446, 485, 494 分析 461-71 评估 473-4, 494 基线特征 461-2, 464-5, 473 临床试验中的偏倚 441, 442, 445, 446, 450, 461, 464, 469, 473, 483, 494 盲法 82, 88, 449, 450, 474, 494 盲法检查清单 473, 491, 494, 496-7 组间可比性 461-2, 464-5, 473 与非对照研究比较 441, 478 对照组 446-7, 480, 483 交叉试验设计 见交叉设计 441-55 诊断(入组)标准 451-2, 454, 460, 471, 485 双盲 450, 494 退出患者见退出标准 资格标准 451-2, 454, 460, 471, 485 伦理方面 449-50, 451, 452-3 排除标准 464, 485 历史对照 446-7, 453 不完整数据 463 知情同意 449, 452, 454 依意向治疗分析 464 结果解释 442, 471-3, 490 医学期刊中的报道 480, 482, 483 随机化方法 491, 494 最小化 443-5 多中心 89, 443, 455, 460 多重比较 453-4, 465 非随机化 443, 446-7, 453, 483 结局指标 16, 393, 453-4, 462, 465-6, 473, 487 平行组 447 I-IV期 440, 442 安慰剂 450-1, 452 方案 454-5, 466
Clinical practice 5 Clinical research 8 Clinical trial 76, 102, 167, 440- 74 adaptive designs 449 adjusting for other variables 375, 464- 5 alternate allocation 446, 485, 494 analysis of 461- 71 assessment of 473- 4, 494 baseline characteristics 461- 2, 464- 5, 473 bias in 441, 442, 445, 446, 450, 461, 464, 469, 473, 483, 494 blindness 82, 88, 449, 450, 474, 494 checklist for 473, 491, 494, 496- 7 comparability of groups 461- 2, 464- 5, 473 comparison with uncontrolled study 441, 478 controls 446- 7, 480, 483 crossover see Crossover trial design 441- 55 diagnostic (entry) criteria 451- 2, 454, 460, 471, 485 double blind 450, 494 dropouts see Withdrawals eligibility criteria 451- 2, 454, 460, 471, 485 ethical aspects 449- 50, 451, 452- 3 exclusions 464, 485 historical controls 446- 7, 453 incomplete data 463 informed consent 449, 452, 454 intention to treat analysis 464 interpretation of results 442, 471- 3, 490 in medical journals 480, 482, 483 method of randomization 491, 494 minimization 443- 5 multi- centre 89, 443, 455, 460 multiple comparisons 453- 4, 465 non- randomized 443, 446- 7, 453, 483 outcome measure 16, 393, 453- 4, 462, 465- 6, 473, 487 parallel groups 447 phases I- IV 440, 442 placebo 450- 1, 452 protocol 454- 5, 466

临床试验(续)
Clinical trial (cont.)

方案(续)
protocol (cont.)

违规 454, 463-4
violation 454, 463- 4

伪随机分配 446, 485, 494
pseudo- random allocation 446, 485, 494

随机分配 85, 91, 442-3, 482, 491
random allocation 85, 91, 442- 3, 482, 491

随机化 442-3, 454, 461, 462, 485, 491
randomization 442- 3, 454, 461, 462, 485, 491

区组 443
block 443

简单随机化 86-7, 443, 444, 452, 464
simple 86- 7, 443, 444, 452, 464

分层 88-9, 443, 444, 450
stratified 88- 9, 443, 444, 450

加权 87, 444, 459
weighted 87, 444, 459

样本选择 451-2, 454, 460, 471, 485
sample selection 451- 2, 454, 460, 471, 485

样本量 443, 447, 452, 454, 455-60, 464, 474, 484
sample size 443, 447, 452, 454, 455- 60, 464, 474, 484

序贯设计 448-9, 455
sequential design 448- 9, 455

副作用 447, 451, 453, 454, 463
side effects 447, 451, 453, 454, 463

单盲 450
single blind 450

亚组分析 466-7, 472, 473
subgroup analyses 466- 7, 472, 473

系统分配 446, 485, 494
systematic allocation 446, 485, 494

治疗分配 86, 87, 442-7, 450, 461, 485
treatment allocation 86, 87, 442- 7, 450, 461, 485

退出 447, 463, 471, 473
withdrawals 447, 463, 471, 473

撰写 473
writing up 473

克洛尼辛 436
Clonixin 436

聚类分析 360
Cluster analysis 360

Cochran, W. G. 248
Cochran, W. G. 248

咖啡 102
Coffee 102

队列生命表 371
Cohort life table 371

队列研究 74, 91, 96-9, 102, 480
Cohort study 74, 91, 96- 9, 102, 480

历史性的 91
historical 91

失访 98-9
loss to follow up 98- 9

受试者的选择 97-98
selection of subjects 97- 8

监测偏倚 99
surveillance bias 99

溃疡性结肠炎 359
Colitis, ulcerative 359

合并 表 270-271
Combining tables 270- 1

合并不同研究的数据 383, 472-473
Combining data from different studies 383, 472- 3

社区对照 94
Community controls 94

组间可比性 39, 78, 81, 88-89, 375, 442, 461-462, 464-465, 473
Comparability of groups 39, 78, 81, 88- 9, 375, 442, 461- 2, 464- 5, 473

比较研究 6, 485 另见临床试验
Comparative study 6, 485 see also Clinical trial

比较
Comparison

分类评估的比较 405-408
of categorical assessments 405- 8

分布的比较 31
of distributions 31

比较(续)
Comparison (cont.)

组间比较(分类数据)160-161,229-272
of groups (categorical data) 160- 1, 229- 72

独立组 232-235,241,250-258,259-265,266-269,319
independent 232- 5, 241, 250- 8, 259- 65, 266- 9, 319

配对组 235-241,258-259,266,269-270
paired 235- 41, 258- 9, 266, 269- 70

组间比较(连续数据)189-223,326-336
of groups (continuous data) 189- 223, 326- 36

独立组 191-223,318-319
independent 191- 223, 318- 19

配对组 189-191,222-223
paired 189- 91, 222- 3

测量方法的比较 277,284,396-403,484,486
of methods of measurement 277, 284, 396- 403, 484, 486

观察者的比较 397,403-409
of observers 397, 403- 9

风险的 266-271
of risks 266- 71

生存的 371-376, 379-385, 386-387
of survival 371- 6, 379- 85, 386- 7

方差的 197-198, 206
of variances 197- 8, 206

复杂分析 223, 285, 360
Complex analyses 223, 285, 360

依从性 463-464
Compliance 463- 4

计算机
Computer

优势 107-108
advantages 107- 8

可用性 vii, 479, 492
availability vii, 479, 492

数据输入 114-116, 122, 123
data input 114- 16, 122, 123

缺点 108-110
disadvantages 108- 10

图形 40, 108, 119-120, 125, 142, 149
graphics 40, 108, 119- 20, 125, 142, 149

误用 120-121, 479, 488, 492
misuses 120- 1, 479, 488, 492

软件包,参见计算机程序
package see Computer program

计算精度 17, 112
precision of calculations 17, 112

模拟 120, 155, 157
simulation 120, 155, 157

软件,参见计算机程序
software see Computer program

统计分析策略 112-114
strategy for statistical analysis 112- 14

使用 13, 38, 107-121, 149, 175, 205
use of 13, 38, 107- 21, 149, 175, 205

计算机显示器(VDU)72, 77, 91, 95, 259-261
Computer monitor (VDU) 72, 77, 91, 95, 259- 61

计算机程序
Computer program

用于方差分析 206, 208, 212-13, 218
for analysis of variance 206, 208, 212- 13, 218

错误 108-9, 111
errors in 108- 9, 111

评估 vii, 110-12
evaluation of vii, 110- 12

精确P值 168, 171, 253
exact P value from 168, 171, 253

用于费舍尔精确检验 254, 256, 257
for Fisher's exact test 254, 256, 257

用于线性回归 293, 302, 308, 310, 312, 320
for linear regression 293, 302, 308, 310, 312, 320

用于逻辑回归 355
for logistic regression 355

用于曼-惠特尼检验 196, 197, 265
for Mann- Whitney test 196, 197, 265

缺失数据 109-10, 124, 130-1
missing data in 109- 10, 124, 130- 1

用于多元回归 344, 345, 348, 349
for multiple regression 344, 345, 348, 349

计算机程序(续)
Computer program (cont.)

用于随机数 86
for random numbers 86

用于秩相关 288, 296
for rank correlation 288, 296

电子表格 112
spreadsheet 112

用于生存分析 366, 370, 375, 377, 379, 389, 391
for survival analysis 366, 370, 375, 377, 379, 389, 391

用于 检验 194
for test 194

用于正态性检验 149, 291, 303 各类型 110
for testing Normality 149, 291, 303 types of 110

结论 4, 7, 477, 482, 483
Conclusions 4, 7, 477, 482, 483

条件概率 368
Conditional probability 368

置信区间 162-165, 223
Confidence interval 162- 5, 223

相关系数的置信区间 279, 282, 288, 293-4, 295, 297
for correlation coefficient 279, 282, 288, 293- 4, 295, 297

几何平均数的置信区间 202
for geometric mean 202

错误使用 486, 487
incorrect use 486, 487

卡帕系数的置信区间 405, 406
for kappa 405, 406

一致性限的置信区间 402
for limits of agreement 402

均值及其差异的置信区间 162-4, 181, 183-4, 190, 192-3, 201-2, 209-10, 221, 222, 329
for means and their differences 162- 4, 181, 183- 4, 190, 192- 3, 201- 2, 209- 10, 221, 222, 329

中位数及其差异的置信区间 173, 185, 194
for medians and their differences 173, 185, 194

置信区间表 535-7
table for 535- 7

优势比的置信区间 269-70
for odds ratio 269- 70

论文中的置信区间 177, 486, 487, 489-90, 498
in papers 177, 486, 487, 489- 90, 498

对比例及其差异 165, 230, 233, 235, 236-7, 252, 253, 271, 369, 416
for proportions and their differences 165, 230, 233, 235, 236- 7, 252, 253, 271, 369, 416

对回归系数 306,
for regression coefficients 306,

313-15, 316, 319, 320, 321, 336,
313- 15, 316, 319, 320, 321, 336,

351, 354

优于假设检验 166, 169, 175, 473, 485
preferable to hypothesis test 166, 169, 175, 473, 485

表达方式 176
presentation of 176

与假设检验的关系 175, 235, 240
relation to hypothesis test 175, 235, 240

对相对风险 267-8
for relative risk 267- 8

对生存时间分析 369-70, 371, 376, 378-9, 383-5, 391
for survival time analyses 369- 70, 371, 376, 378- 9, 383- 5, 391

变换后 199, 201-2
after transformation 199, 201- 2

置信限见置信区间
Confidence limits see Confidence interval

确证性分析 338
Confirmatory analysis 338

混杂因素 81, 402-3, 484
Confounding 81, 402- 3, 484

列联表 参见 频数表
Contingency table see Frequency table

连续性校正
Continuity correction

卡方检验的连续性校正 252-3, 260
for Chi squared test 252- 3, 260

连续性校正(续)
Continuity correction (cont.)

比例比较的连续性校正 231-2, 235, 238-9
for comparison of proportions 231- 2, 235, 238- 9

McNemar 检验的连续性校正 258-9
for McNemar's test 258- 9

符号检验的连续性校正 187
for sign test 187

连续数据 12-13, 40, 117-18, 123-4, 165, 272, 434, 457-8
Continuous data 12- 13, 40, 117- 18, 123- 4, 165, 272, 434, 457- 8

分析 179-223, 277-321, 325-61
analysis of 179- 223, 277- 321, 325- 61

口服避孕药 1, 50, 95, 105
Contraceptives, oral 1, 50, 95, 105

对照组 77
Control group 77

在病例对照研究中 93-94, 484
in case control study 93- 4, 484

在临床试验中 446-447, 480, 483
in clinical trial 446- 7, 480, 483

受控试验,参见临床试验
Controlled trial see Clinical trial

坐标 N513
Coordinates N513

冠状动脉疾病 363
Coronary artery disease 363

相关性 277-300, 341, 344, 351, 401
Correlation 277- 300, 341, 344, 351, 401

用于评估非正态性 291-292
for assessing non- Normality 291- 2

关联与因果关系 247, 297-8, 321
association and causality 247, 297- 8, 321

假设 279
assumptions 279

系数 278, 299
coefficient 278, 299

与回归的区别 277, 320-1
distinction from regression 277, 320- 1

国际标准 290-1, 298
international 290- 1, 298

解释 297-8, 318
interpretation 297- 8, 318

Kendall等级相关系数 (r) 286
Kendall's rank (r) 286

数学基础 293-6
mathematics 293- 6

矩阵 288, 299, 342
matrix 288, 299, 342

误用 282-5, 320-1, 401-2, 409, 489
misuses 282- 5, 320- 1, 401- 2, 409, 489

混合样本 283
mixed samples 283

偏差 288-291, 296, 297, 348-349
partial 288- 91, 296, 297, 348- 9

Pearson(积差相关系数)(r)
Pearson's (product moment) (r)

278-286, 288, 293-294, 297, 346
278- 86, 288, 293- 4, 297, 346

置信区间 279, 282, 293-294, 297
confidence interval 279, 282, 293- 4, 297

假设检验 279, 282, 294, 320
hypothesis test 279, 282, 294, 320

表格见 528-529
table for 528- 9

展示 299
presentation 299

秩 265, 279, 285-288, 295-296, 297
rank 265, 279, 285- 8, 295- 6, 297

限制样本 283
restricted sample 283

样本量 298
sample size 298

斯皮尔曼等级相关系数 286-8, 295-6, 297
Spearman's rank 286- 8, 295- 6, 297

置信区间 288, 295
confidence interval 288, 295

假设检验 287, 295-6
hypothesis test 287, 295- 6

表格 530
table for 530

虚假关系 283, 285
spurious 283, 285

尼古丁代谢物(Cotinine) 226
Cotinine 226

库尔特计数器(Coulter counter) 90
Coulter counter 90

计数 11, 66, 143, 241 另见 频率
Count 11, 66, 143, 241see also Frequency

协方差分析 309, 318, 339, 465
Covariance, analysis of 309, 318, 339, 465

协变量 80, 389, 392 另见 预测变量
Covariate 80, 389, 392see also Predictor variable

Cox 回归 387-93 345
Cox regression 387- 93 345

肌酐 145
Creatinine 145

肌酐清除率 323
Creatinine clearance 323

犯罪 68
Crime 68

克罗恩病 359
Crohn's disease 359

交叉分类 229, 326 另见 频数表
Cross- classification 229, 326see also Frequency table

横断面研究 76, 92, 99-101, 102, 480
Cross- sectional study 76, 92, 99- 101, 102, 480

因果关系?100-101
cause or effect? 100- 1

响应率 100
response rate 100

样本选择 99-100
sample selection 99- 100

志愿者偏倚 100
volunteer bias 100

列联表 229, 326 另见 频数表
Cross- tabulation 229, 326see also Frequency table

交叉试验
Crossover trial

分析 467-471
analysis 467- 71

基线数据 469-471
baseline data 469- 71

残留效应 448, 469
carry- over effect 448, 469

设计 447-448, 467
design 447- 8, 467

时期效应 448, 467, 469
period effect 448, 467, 469

治疗效应 469
treatment effect 469

治疗-周期交互作用 448, 467, 469
treatment- period interaction 448, 467, 469

洗脱期 448, 469, 471
wash- out period 448, 469, 471

苦楝油 272
Croton oil 272

立方体 N507
Cube N507

三次曲线 425
Cubic curve 425

累积频数 29-31, 133
Cumulative frequency 29- 31, 133

累积相对频数 29-31, 54
Cumulative relative frequency 29- 31, 54

曲线 Curve

三次的 425
cubic 425

二次的 310, 317, 319, 424
quadratic 310, 317, 319, 424

正弦(正弦波)434-5
sine (sinusoidal) 434- 5

周期性变化 433-5
Cyclic variation 433- 5

另见昼夜节律变化;季节性变化
see also Circadian variation; Seasonal variation

囊性纤维化 21, 50, 73, 97, 338, 347, 349
Cystic fibrosis 21, 50, 73, 97, 338, 347, 349

数据 Data

二元变量 10, 339, 351-2, 359, 414, 458
binary 10, 339, 351- 2, 359, 414, 458

分类变量 10-11, 117, 123, 229, 339, 351
categorical 10- 11, 117, 123, 229, 339, 351

分类变量分析 229-72
analysis of 229- 72

截尾数据 16, 22, 365, 369, 370, 378, 385, 394
censored 16, 22, 365, 369, 370, 378, 385, 394

检查 113, 122-6, 149
checking 113, 122- 6, 149

清理 122-6
cleaning 122- 6

收集 3, 114-19, 485
collection 3, 114- 19, 485

连续型 12-13, 40, 117-18, 123-4, 165, 272, 434, 457-8
continuous 12- 13, 40, 117- 18, 123- 4, 165, 272, 434, 457- 8

分析 179-223, 277-321, 325-61
analysis of 179- 223, 277- 321, 325- 61

325- 61

描述 19-45
description 19- 45

离散型 11, 63, 66
discrete 11, 63, 66

挖掘 120-1, 282

  • dredging 120-1, 282

录入 113, 122-3
entry 113, 122- 3

名义型 11
nominal 11

顺序型(有序类别)11, 180, 229, 249, 261-5, 272, 319, 409, 434, 486
ordinal (ordered categorical) 11, 180, 229, 249, 261- 5, 272, 319, 409, 434, 486

229, 249, 261-5, 272, 319, 409, 434, 486
229, 249, 261- 5, 272, 319, 409, 434, 486

精度 487
precision 487

表现 18, 42-5, 221, 271, 433
presentation 18, 42- 5, 221, 271, 433

注册系统 484
registries 484

筛查 113, 122, 132-43, 148, 149
screening 113, 122, 132- 43, 148, 149

转换 见 数据转换
transformation see Transformation of data

类型 10-18, 180
types of 10- 18, 180

另见 数据分析;数据分布;频率;比例
see also Analysis of data; Distribution of data; Frequency; Proportion

数据库 118
Database 118

日期 13, 118, 125-6, 131
Dates 13, 118, 125- 6, 131

死亡 393
Death 393

年龄特异率 499
age- specific rate 499

每日数量 68
number per day 68

小数位数 12, 42
Decimal places 12, 42

自由度 181, 192, 209, 245, 246-7
Degrees of freedom 181, 192, 209, 245, 246- 7

分母 487, N507
Denominator 487, N507

牙釉质腐蚀 250-2, 269
Dental enamel erosion 250- 2, 269

牙齿健康 96
Dental health 96

因变量
Dependent variable

Cox 回归中的因变量 388
in Cox regression 388

线性回归中 301
in linear regression 301

逻辑回归中 352, 355
in logistic regression 352, 355

多元回归中 340, 345, 346, 351
in multiple regression 340, 345, 346, 351

派生变量 13, 108, 125, 131, 340, 429-431
描述数据 19-45, 152
设计 4, 5, 6, 74-103, 402, 426, 480
平衡设计 80
设计选择 102-103
设计中的错误 477, 482-485
设计质量 5
与分析的关系 5, 179-180, 426, 430
设计结构 83-85
设计类型 75-77
设计的缺陷 84
检测偏倚 95
糖尿病 2, 71
糖尿病患者 94, 172-173, 177-178, 300-303, 310, 475
诊断 3, 335-337, 359, 413, 414, 417, 418, 419
另见诊断测试
诊断指数 355
诊断测试 356, 409-419, 420, 425, 483, 486, 494
基于连续测量的诊断测试 413-414, 419
患病率的影响 411-413, 418
腹泻 359
短波电刀 38
二氯乙酸盐 321
二分变量 10
饮食建议 6
饮食摄入 3, 95, 96
能量摄入 183-185, 188, 189-191
与月经周期相关的饮食摄入 189-191
饮食 84, 333, 361
差异 见比较
数字偏好 146-148
地高辛 323
二硝基氯苯(ONCB) 272
离散数据 11, 63, 66
判别分析 355, 358-360
判别函数 359
判别 355-360, 413, 414
无分布方法 见非参数方法
分布假设 51, 58, 143, 171-172, 174, 189, 486
Derived variable 13, 108, 125, 131, 340, 429- 31 Describing data 19- 45, 152 Design 4, 5, 6, 74- 103, 402, 426, 480 balanced 80 choice of 102- 3 errors in 477, 482- 5 quality of 5 relation to analysis 5, 179- 80, 426, 430 structure of 83- 5 types of 75- 7 weaknesses in 84 Detection bias 95 Diabetes 2, 71 Diabetics 94, 172- 3, 177- 8, 300- 3, 310, 475 Diagnosis 3, 335- 7, 359, 413, 414, 417, 418, 419 see also Diagnostic test Diagnostic index 355 Diagnostic test 356, 409- 19, 420, 425, 483, 486, 494 based on continuous measurement 413- 14, 419 effect of prevalence 411- 13, 418 Diarrhoea 359 Diathermy, short wave 38 Dichloroacetate 321 Dichotomous variable 10 Dietary advice 6 Dietary intake 3, 95, 96 of energy 183- 5, 188, 189- 91 and menstrual cycle 189- 91 Diets 84, 333, 361 Differences see Comparisons Digit preference 146- 8 Digoxin 323 Dinitrochlorobenzene (ONCB) 272 Discrete data 11, 63, 66 Discriminant analysis 355, 358- 60 Discriminant function 359 Discrimination 355- 60, 413, 414 Distribution- free methods see Non- parametric methods Distribution assumption about 51, 58, 143, 171- 2, 174, 189, 486

分布(续)
非对称分布 136
双峰分布 53
二项分布 见二项分布
数据分布 29-31, 33, 38, 51, 132-133, 173, 180, 279
经验分布 51, 420
指数分布 385
F分布 见F分布
频率 23, 29-31, 54, 133
对数正态分布 60-63, 143, 164, 392, 420
正态分布 见正态分布
泊松分布 66-68, 70-71, 145, 246
概率 50-51
样本均值的概率 154-157, 177, 181
抽样分布 见抽样分布
分布形状 38, 136, 194
偏斜(非对称)分布 36-38, 53, 59, 60-63, 136, 145, 157, 172, 185, 199-205, 221-222, 392, N510
对称分布 36, 53, 61, 133, 154, 188, 204-205, 222, 299
t分布 见t分布
分布尾部 36, 58, 139, 166-167, 171, 181, 255, 257, 421
理论分布 50-71, 171, 175
均匀分布 71, 120, 146
单峰分布 53, 154
另见正态分布
除法 N507
双卵双胎率 288-290, 295, 296
医生 451
双盲技术 450
唐氏综合征 94, 419
退出者 见退出
药物 3, 90
药物使用者 437
双光子吸收法 279
虚拟变量 339, 392
邓肯多重范围检验 211
十二指肠溃疡 441
Distribution (cont.) asymmetric 136 bimodal 53 Binomial see Binomial distribution of data 29- 31, 33, 38, 51, 132- 3, 173, 180, 279 empirical 51, 420 exponential 385 F see F distribution frequency 23, 29- 31, 54, 133 Lognormal 60- 3, 143, 164, 392, 420 Normal see Normal distribution Poisson 66- 8, 70- 1, 145, 246 probability 50- 1 of sample means 154- 7, 177, 181 sampling see Sampling distribution shape of 38, 136, 194 skewed (asymmetric) 36- 8, 53, 59, 60- 3, 136, 145, 157, 172, 185, 199- 205, 221- 2, 392, N510 symmetric 36, 53, 61, 133, 154, 188, 204- 5, 222, 299 t see t distribution tails of 36, 58, 139, 166- 7, 171, 181, 255, 257, 421 theoretical 50- 71, 171, 175 Uniform 71, 120, 146 unimodal 53, 154 see also Normal distribution of data Division N507 Dizygotic twinning rate 288- 90, 295, 296 Doctors 451 Double dummy technique 450 Down's syndrome 94, 419 Drop outs see Withdrawals Drugs 3, 90 users 437 Dual photon absorptiometry 279 Dummy variable 339, 392 Duncan's multiple range test 211 Duodenal ulcer 441

e N510 e* N511 超声心动图 300, 397 兴趣效应 83, 165, 166, 169, 455
e N510 e* N511 Echocardiography 300, 397 Effect of interest 83, 165, 166, 169, 455

鸡蛋 1, 436 射血分数 149 心电图 2 资格标准 见样本选择 呕吐 368 经验分布 51, 420 就业 100 依那普利酯 328 子宫内膜异位症 227 终点 见临床试验,结局指标 能量消耗 193-4, 197 摄入量 183-5, 188, 189-91 信封 88, 89, 450, 485 流行病学研究 8, 75, 91, 102, 266, 268, 297, 352, 355, 396, 478 误差条 426 分析中的错误 261, 385-7, 401-2, 426-7, 477, 482, 486-7 遗漏错误 7, 474, 482, 490-1 四舍五入 17, 71, 312 统计学见论文中的统计错误 抄录错误 122-3, 124, 125 第一类错误 169, 211, 457-9, N511 第二类错误 169, 457-9, N512 估计 153, 174-5, 176 精确度 80, 83, 175, 465, 487 不确定性 153, 154, 157, 160, 162-5, 175 另见置信区间 估计方法 160-5, 223 与假设检验 166, 174-5, 271, 397 伦理 477-8, 491 临床试验伦理 449-50, 451, 452-3 委员会 453, 454, 491, 492 排除标准 见样本选择 期望频数 243-4, 246, 247, 248, 250-1, 253, 254, 256, 404, 406 实验 74, 75-6, 102, 325 临床(试验) 440-74 设计 80-5 实验室 8, 90, 318 解释变量 见预测变量
Eggs 1, 436 Ejection fraction 149 Electrocardiograph 2 Eligibility criteria see Sample selection Emesis 368 Empirical distribution 51, 420 Employment 100 Enalaprilat 328 Endometriosis 227 Endpoint see Clinical trial, outcome measure Energy expenditure 193- 4, 197 intake 183- 5, 188, 189- 91 Envelopes 88, 89, 450, 485 Epidemiological study 8, 75, 91, 102, 266, 268, 297, 352, 355, 396, 478 Error bars 426 Errors in analysis 261, 385- 7, 401- 2, 426- 7, 477, 482, 486- 7 of omission 7, 474, 482, 490- 1 rounding 17, 71, 312 statistical see Statistical errors in papers transcription 122- 3, 124, 125 Type I 169, 211, 457- 9, N511 Type II 169, 457- 9, N512 Estimate 153, 174- 5, 176 precision of 80, 83, 175, 465, 487 uncertainty of 153, 154, 157, 160, 162- 5, 175 see also Confidence interval Estimation 160- 5, 223 and hypothesis testing 166, 174- 5, 271, 397 Ethics 477- 8, 491 of clinical trials 449- 50, 451, 452- 3 committee 453, 454, 491, 492 Exclusion criteria see Sample selection Expected frequency 243- 4, 246, 247, 248, 250- 1, 253, 254, 256, 404, 406 Experiments 74, 75- 6, 102, 325 clinical (trials) 440- 74 design of 80- 5 laboratory 8, 90, 318 Explanatory variable see Predictor variable

探索性分析 113, 121, 174, 282, 298, 338, 359 指数分布 385 暴露 91, 374 外推 6, 7, 100, 152, 317 视力 254-7 应力 77, 259-61
Exploratory analysis 113, 121, 174, 282, 298, 338, 359 Exponential distribution 385 Exposure 91, 374 Extrapolation 6, 7, 100, 152, 317 Eye sight 254- 7 strain 77, 259- 61

分布 197, 206, 209, N513 与 t 分布的关系 207 表格 524-7 检验 197, 206, 207, 211, 219-20 因子 83, 326 因子分析 360 因子设计 (!) 70, 256, N509, N513 因子设计 84 因子试验 449 失败时间 见生存时间 假阴性发现 169, 414, 418, 419 率 415 假阳性 357 发现 169, 211, 414, 418, 419 率 415 热性惊厥 97 胎儿生长 267, 425 胎动 77 胎儿 419, 423 腹部区域 319 头围 101, 331-3 头皮血 pH 422 超声 484 345, 347 纤维 287, 298 纤维蛋白原 288, 348-9 图形 见图表 文件 113, 118-19 Fisher, R. A. 253-4 Fisher 精确检验 253-7 饮用水氟化 1, 90, 500 叶酸 207, 210, 211-12 随访 随访时间 393 失访 见退出 研究 96-9
distribution 197, 206, 209, N513 relation to t distribution 207 table of 524- 7 test 197, 206, 207, 211, 219- 20 Factor 83, 326 Factor analysis 360 Factorial (!) 70, 256, N509, N513 Factorial design 84 Factorial trial 449 Failure time see Survival time False negative findings 169, 414, 418, 419 rate 415 False positive 357 findings 169, 211, 414, 418, 419 rate 415 Febrile seizure 97 Fetal growth 267, 425 Fetal movements 77 Fetus 419, 423 abdominal area 319 head curcunference 101, 331- 3 scalp blood pH 422 ultrasound 484 345, 347 Fibre 287, 298 Fibrinogen 288, 348- 9 Figures see Graphs File 113, 118- 19 Fisher, R. A. 253- 4 Fisher's exact test 253- 7 Fluoridation of drinking water 1, 90, 500 Folate 207, 210, 211- 12 Follow up duration of 393 loss to see Withdrawals study 96- 9

用力呼气容积 345, 347 表格,数据收集 114-19, 454 设计 116-18 数据输入格式 114-16 频率 23, 229, 241, 434 累积 29-31, 54, 133 分布 23, 29-31, 54, 133 期望 243-4, 246, 247, 248, 250-1, 253, 254, 256, 404, 406 直方图 见直方图 观察值 243, 246, 248, 253, 406 多边形 27, 52-3 相对 27, 29, 51 表格 229, 241-66 多向 360 249, 250-9, 260, 266, 269, 270-1, 379 Friedman's 双因素方差分析 334-6 函数 N510 功能残气量(FRC) 361
Forced expiratory volume 345, 347 Form, data collection 114- 19, 454 design 116- 18 Format for data input 114- 16 Frequency 23, 229, 241, 434 cumulative 29- 31, 54, 133 distribution 23, 29- 31, 54, 133 expected 243- 4, 246, 247, 248, 250- 1, 253, 254, 256, 404, 406 histogram see Histogram observed 243, 246, 248, 253, 406 polygon 27, 52- 3 relative 27, 29, 51 table 229, 241- 66 multi- way 360 249, 250- 9, 260, 266, 269, 270- 1, 379 Friedman's two way analysis of variance 334- 6 Function N510 Functional residual capacity (FRC) 361

胃冷冻 441 高斯,C. F. 51 高斯分布参见正态分布 婴儿性别 49, 71-72, 177 全科医疗 7 全科医生(GP)年龄-性别登记 7 转诊 99 几何平均数 22, 37-38, 62, 164, 202 妊娠期 42 与出生体重 266, 310, 326, 425 格司特酮 227 眼镜(镜片) 216, 254-257 球蛋白 288 葡萄糖 172-173, 177-178, 292, 300-303, 306-309, 310-318, 320, 323 拟合优度 174 另见具体方法 Gossett, W. S. 181 移植物抗宿主病(GvHD) 361, 395
Gastric freezing 441 Gauss, C. F. 51 Gaussian distribution see Normal distribution Gender of baby 49, 71- 2, 177 General practice 7 General practitioner (GP) age- sex register 7 referral 99 Geometric mean 22, 37- 8, 62, 164, 202 Gestational age 42 and birthweight 266, 310, 326, 425 Gestrinone 227 Glasses (spectacles) 216, 254- 7 Globulin 288 Glucose 172- 3, 177- 8, 292, 300- 3, 306- 9, 310- 18, 320, 323 Goodness- of- fit 174 see also under specific methods Gossett, W. S. 181 Graft versus host disease (GvHD) 361, 395

图表 16, 43-45, 221, 369, 397, 400, 402, 418, 426 误差条 221, 426 误导性 488-489 科学论文中 488-489 另见具体图表类型 格林伍德,M. 379 格林伍德标准误 379 组序贯试验 449 组的可比性 39, 78, 81, 88-89, 375, 442, 461-462, 464-465, 473 生长图表(标准) 425 发育迟缓 267 与体型比较 101 指南 参见统计指南
Graphs 16, 43- 5, 221, 369, 397, 400, 402, 418, 426 error bars in 221, 426 misleading 488- 9 in scientific papers 488- 9 see also specific types of graph Greenwood, M. 379 Greenwood's standard error 379 Group sequential trial 449 Groups, comparability of 39, 78, 81, 88- 9, 375, 442, 461- 2, 464- 5, 473 Growth chart (standard) 425 retardation 267 versus size 101 Guidelines see Statistical guidelines

血细胞比容 288, 348 血液透析 226 血红蛋白 90 习惯用手 2, 103, 105 估计值(如 p) N509 危险 388, 390-391 函数 388, 392 比率 375-376, 383-384, 385, 390 头围 101, 331-333 头痛 165, 214 健康 2 教育 90, 99 青少年犯罪者健康 254 健康工人效应 484 听力 75 心脏病发作 2, 357 疾病 2, 7, 51, 358 与打鼾相关 264 瓣膜性 363 心力衰竭 149, 323, 327-328 - 肺移植 364 发生率 223, 327-331 容积 396 身高 49, 59, 146, 336, 350, 351 成年人身高 51, 71, 72, 499 儿童身高 125, 457 增长 457 直升机 334-336 肝素 226 慢性活动性肝炎 178
Haematocrit 288, 348 Haemodialysis 226 Haemoglobin 90 Handedness 2, 103, 105 Hat (e.g. p) N509 Hazard 388, 390- 1 function 388, 392 ratio 375- 6, 383- 4, 385, 390 Head circumference 101, 331- 3 Headache 165, 214 Health 2 education 90, 99 of juvenile deliquents 254 Healthy worker effect 484 Hearing 75 Heart attack 2, 357 disease 2, 7, 51, 358 and snoring 264 valvular 363 failure 149, 323, 327- 8 - lung transplantation 364 rate 223, 327- 31 volume 396 Height 49, 59, 146, 336, 350, 351 of adults 51, 71, 72, 499 of children 125, 457 gain 457 Helicopter 334- 6 Heparin 226 Hepatitis, chronic active 178

希尔,A. Bradford 478, 492 直方图 23-28, 33, 36-37, 39, 43, 51, 52, 60, 113, 133 相对频率 27, 29 历史对照 446-447, 453 HIV 血清阳性 413, 416, 419 霍奇金病 126, 150, 200-205, 225 方差齐性 143, 180, 192, 197-198, 199-201, 206, 303 激素 434 医院 6, 177 入院率 93 对照 93 记录 114, 122, 132 数量 446 HTLV-III 抗体 413 高血压 79, 132, 352-356, 413, 425 妊娠期高血压 275, 476 另见血压 假设检验 165-171, 174 备择假设 165, 168 错误,I型和II型 169, 211, 457-459 不当使用 397, 403, 461, 486 解释 167-169, 177, 222-223, 429, 485, 489 多重检验 211, 453-454, 465 原假设 165-167, 168, 170 表达 175, 176-177, 220, 221, 487 与置信区间的关系 175, 235, 240 与样本量的关系 167, 169, 455-459, 485 显著性水平 168, 344-345 双侧或单侧 170-171, 177, 214 另见 P 值;功效;统计显著性 甲状腺功能减退婴儿 198
Hill, A. Bradford 478, 492 Histogram 23- 8, 33, 36- 7, 39, 43, 51, 52, 60, 113, 133 relative frequency 27, 29 Historical controls 446- 7, 453 HIV seropositivity 413, 416, 419 Hodgkin's disease 126, 150, 200- 5, 225 Homogeneity of variance 143, 180, 192, 197- 8, 199- 201, 206, 303 Hormones 434 Hospital 6, 177 admission rates 93 controls 93 notes 114, 122, 132 numbers 446 HTLV- III antibody 413 Hypertension 79, 132, 352- 6, 413, 425 in pregnancy 275, 476 see also Blood pressure Hypothesis test 165- 71, 174 alternative hypothesis 165, 168 errors, types I and II 169, 211, 457- 9 inappropriate use 397, 403, 461, 486 interpretation 167- 9, 177, 222- 3, 429, 485, 489 multiple tests 211, 453- 4, 465 null hypothesis 165- 7, 168, 170 presentation 175, 176- 7, 220, 221, 487 relation to confidence interval 175, 235, 240 and sample size 167, 169, 455- 9, 485 significance level 168, 344- 5 two- sided or one- sided 170- 1, 177, 214 see also P value; Power; Statistical significance Hypothyroid infants 198

IgE 405, 434, 435 IgM 23-24, 27, 28, 31-33, 36, 38, 41, 51, 53, 59, 420-421, 423-425 失衡 88-89, 375, 442, 461-462, 464-465 浸没服 334-336 不精确 参见精确性 记忆不准确 94-95 纳入标准 参见样本选择 独立性 49-50, 350
IgE 405, 434, 435 IgM 23- 4, 27, 28, 31- 3, 36, 38, 41, 51, 53, 59, 420- 1, 423- 5 Imbalance 88- 9, 375, 442, 461- 2, 464- 5 Immersion suits 334- 6 Imprecision see Precision Inaccurate recall 94- 5 Inclusion criteria see Sample selection Independence 49- 50, 350

独立组 180 独立观察 230, 282, 318 自变量 参见预测变量 统计推断 5, 71, 169, 490 另见样本作为总体估计 无限 N510, N513 炎症性肠病 359 红外刺激(IRS) 233-235, 252-253 输入格式 114-116 意向治疗分析 464 观察者间变异 参见观察者 四分位距 33 评分者一致性 参见观察者 交互作用 330, 331, 351, 354, 449, 467 截距 302, 312, 315, 316, 320 国际相关 290-291, 298 插值 33, 38 结果解释 5, 8, 81, 94, 174, 177, 337, 442, 471-473, 482, 489-490 干预研究 152 肠易激综合征 504 缺血性心脏病 7, 358
Independent groups 180 Independent observations 230, 282, 318 Independent variable see Predictor variable Inference, statistical 5, 71, 169, 490 see also Sample, as estimate of population Infinity N510, N513 Inflammatory bowel disease 359 Infra- red stimulation (IRS) 233- 5, 252- 3 Input format 114- 16 Intention to treat analysis 464 Inter- observer variation see Observers Inter- quartile range 33 Inter- rater agreement see Observers Interaction 330, 331, 351, 354, 449, 467 Intercept 302, 312, 315, 316, 320 International correlation 290- 1, 298 Interpolation 33, 38 Interpretation of results 5, 8, 81, 94, 174, 177, 337, 442, 471- 3, 482, 489- 90 Intervention study 152 Irritable bowel syndrome 504 Ischaemic heart disease 7, 358

美国医学会杂志 465 期刊 参见医学期刊 青少年犯罪者 254-257
Journal of the American Medical Association 465 Journals see Medical journals Juvenile delinquents 254- 7

Kaplan-Meier 生存曲线 368-371, 377-379, 384, 385, 386, 394 Kappa 404-408 置信区间 405, 406 解释 407-409 加权 406, 407 Kendall, M. G. 286 Kendall's tau (t) 286 肾移植 124, 145, 360 膝围 401 Kruskal-Wallis 检验 213-215, 335 有序组 215-216, 265 峰度 136
Kaplan- Meier survival curve 368- 71, 377- 9, 384, 385, 386, 394 Kappa 404- 8 confidence interval for 405, 406 interpretation 407- 9 weighted 406, 407 Kendall, M. G. 286 Kendall's tau (t) 286 Kidney transplant 124, 145, 360 Knee circumference 401 Kruskal- Wallis test 213- 15, 335 ordered groups 215- 16, 265 Kurtosis 136

实验室实验 8, 90, 318 拟合不足 参见拟合优度
Laboratory experiment 8, 90, 318 Lack of fit see Goodness- of- fit

乳酸性酸中毒 321, 394
Lactic acidosis 321, 394

柳叶刀 465, 478, 488
Lancet 465, 478, 488

纬度 288-90, 295, 296
Latitude 288- 90, 295, 296

最小二乘法 301
Least squares method 301

左撇子 2, 103, 105
Left- handedness 2, 103, 105

左心室搏出量(SV) 397-400
Left ventricular stroke volume (SV) 397- 400

白血病 1, 3, 362
Leukaemia 1, 3, 362

测谎仪 437
Lie- detector 437

生活事件 269-70
Life events 269- 70

生命表 368, 371, 394
Life table 368, 371, 394

似然比 417
Likelihood ratio 417

一致性限度 399-400, 402
Limits of agreement 399- 400, 402

线性类比量表 15-16, 172
Linear analogue scale 15- 16, 172

线性模型 173
Linear model 173

线性回归 126-127, 213, 219, 262, 277, 299-321, 326, 336, 337, 402, 423, 430
Linear regression 126- 7, 213, 219, 262, 277, 299- 321, 326, 336, 337, 402, 423, 430

方差分析表 297, 308, 316
analysis of variance table for 297, 308, 316

假设 303
assumptions 303

置信区间 320, 321
confidence intervals 320, 321

估计的置信区间 314-315
for estimate 314- 15

直线的置信区间 306-307, 314-315, 316, 319, 320
for line 306- 7, 314- 15, 316, 319, 320

320

斜率的置信区间 306, 313, 319, 320
for slope 306, 313, 319, 320

因变量 301
dependent variable 301

与相关性的区别 277, 320-1
distinction from correlation 277, 320- 1

解释变量 301, 303, 317
explanatory variable 301, 303, 317

外推 316-17
extrapolation 316- 17

拟合值 301, 312, 314
fitted values 301, 312, 314

拟合优度 302-6, 308
goodness- of- fit 302- 6, 308

假设检验 315-16, 319, 320
hypothesis tests 315- 16, 319, 320

自变量 301, 303, 317
independent variable 301, 303, 317

直线截距 302, 312, 315, 316, 320
intercept of line 302, 312, 315, 316, 320

解释 316-18
interpretation 316- 18

最小二乘法原理 301
least squares principle 301

直线 301-3, 306, 310-12
line 301- 3, 306, 310- 12

数学 310-16
mathematics 310- 16

非参数 318
non- parametric 318

结果变量 301
outcome variable 301

预测区间 307, 315, 316, 319, 320, 321
prediction interval 307, 315, 316, 319, 320, 321

预测变量 301, 303, 317
predictor variable 301, 303, 317

展示 319-20, 487-8
presentation of 319- 20, 487- 8

308
308

线性回归(续)
Linear regression (cont.)

残差 301, 303-6, 308, 313, 423
residual 301, 303- 6, 308, 313, 423

残差标准差 308, 313, 319, 320
residual standard deviation 308, 313, 319, 320

残差方差 302, 308, 313
residual variance 302, 308, 313

应变量 301
response variable 301

直线斜率 262, 302, 306, 309, 311-12, 315, 316, 319, 320, 430
slope of line 262, 302, 306, 309, 311- 12, 315, 316, 319, 320, 430

标准误差 306, 314-15, 319
standard error 306, 314- 15, 319

两个样本 309, 339
two samples 309, 339

变异解释 308-9, 316
variation explained 308- 9, 316

线性关系 279, 296, 300, 303
Linear relation 279, 296, 300, 303

线性变换 41
Linear transformation 41

线性趋势
Linear trend

针对分类数据 249,261-5,319,339
for categorical data 249, 261- 5, 319, 339

针对连续数据 212-13,215-16,220,318-19,339
for continuous data 212- 13, 215- 16, 220, 318- 19, 339

肝脏
Liver

活检 66
biopsy 66

扫描 409-11
scan 409- 11

移植 368
transplantation 368

另见:肝硬化;原发性胆汁性肝硬化
see also Cirrhosis; Primary biliary cirrhosis

对数线性模型 272
Log- linear model 272

对数比值 352
Log odds 352

对数 36, N510, N511
Logarithm 36, N510, N511

对数变换 36-7, 41-2, 60-2, 126, 136, 143-5, 199, 200-3, 205, 287, 303-6, 392, 400-1, N510
Logarithmic transformation 36- 7, 41- 2, 60- 2, 126, 136, 143- 5, 199, 200- 3, 205, 287, 303- 6, 392, 400- 1, N510

N510, N511
N510, N511

逻辑检查 124-5
Logical check 124- 5

逻辑回归 146, 326, 351-8, 359 系数 354, 358
Logistic regression 146, 326, 351- 8, 359 coefficient 354, 358

计算 355
computing 355

置信区间 354
confidence interval 354

因变量 352, 355
dependent variable 352, 355

判别分析 355-8, 413, 414
for discrimination 355- 8, 413, 414

自变量 352, 355
explanatory variable 352, 355

拟合优度 358
goodness- of- fit 358

预后指数 355
prognostic index 355

预后变量 355
prognostic variable 355

逐步法 355
stepwise 355

Logit 变换 145-6, 352
Logit transformation 145- 6, 352

对数正态分布 60-3, 143, 164, 392, 420
Lognormal distribution 60- 3, 143, 164, 392, 420

Logrank 检验 371-5, 379-83, 385, 386, 394
Logrank test 371- 5, 379- 83, 385, 386, 394

Logrank 检验(续)
Logrank test (cont.)

趋势检验 374-5, 381-2
for trend 374- 5, 381- 2

分层检验 375, 382-3
stratified 375, 382- 3

寿命 2, 103, 105
Longevity 2, 103, 105

纵向研究 76, 96-9, 466
Longitudinal study 76, 96- 9, 466

失访 见 退出
Loss to follow up see Withdrawals

腰痛 38-9
Low back pain 38- 9

下呼吸道感染 438
Lower respiratory tract infection 438

肺容量 364
Lung capacity 364

肺功能 21, 351, 466
Lung function 21, 351, 466

另见 ;PEmax;PImax
see also ; PEmax; PImax

肺移植 364
Lung transplantation 364

淋巴细胞异常 200
Lymphocyte abnormalities 200

主效应 354
Main effect 354

哺乳动物 279
Mammals 279

乳腺X线摄影 356-7
Mammography 356- 7

另见 Xeromammogram
see also Xeromammogram

Mann-Whitney(Wilcoxon)检验 194-7, 198, 213, 214, 215, 265, 467
Mann- Whitney (- Wilcoxon) test 194- 7, 198, 213, 214, 215, 265, 467

表格见 532-4
table for 532- 4

Mantel-Haenszel 方法 271
Mantel- Haenszel method 271

大麻 236-9
Marijuana 236- 9

婚姻状况 242-4, 247-8, 339
Marital status 242- 4, 247- 8, 339

乳房切除术 76, 452
Mastectomy 76, 452

匹配 94, 180, 189, 448, 491
Matching 94, 180, 189, 448, 491

数学符号 9, N505-13
Mathematical notation 9, N505- 13

矩阵
Matrix

相关系数 288, 299
correlation coefficients 288, 299

散点图 342
scatter plots 342

最大静态呼气压(PEmax)336, 339, 341, 342-345, 347, 350
Maximal static expiratory pressure (PEmax) 336, 339, 341, 342- 5, 347, 350

最大静态吸气压(PImax)21-22, 23, 28, 35, 40, 109
Maximal static inspiratory pressure (PImax) 21- 2, 23, 28, 35, 40, 109

最大似然法 355
Maximum likelihood 355

McNemar检验 240, 258-259, 266, 416
McNemar test 240, 258- 9, 266, 416

均值 21, 33, 37, 38, 41, 171, 221, 222, 423-424, 430, N506
Mean 21, 33, 37, 38, 41, 171, 221, 222, 423- 4, 430, N506

置信区间见162-4, 181, 209-10, 221, 222, 329
confidence intervals for 162- 4, 181, 209- 10, 221, 222, 329

单一样本见183-4
single sample 183- 4

两配对样本见190, 201-2
two paired samples 190, 201- 2

两独立样本
two unpaired (independent)

样本见192-3
samples 192- 3

几何分布见22, 37-8, 62, 164, 202
geometric 22, 37- 8, 62, 164, 202

均值(续)
Mean (cont.)

多个样本的假设检验
hypothesis test for

见206-9
several samples 206- 9

单一样本见184-5
single sample 184- 5

两个配对样本 191
two paired samples 191

两个非配对(独立)
two unpaired (independent)

样本 194
samples 194

展示 42, 487, 488, 489
presentation of 42, 487, 488, 489

抽样分布 153-7, 177, 181
sampling distribution of 153- 7, 177, 181

平方 209
square 209

标准误差 参见 标准误差
standard error of see Standard error

媒体报道 1, 2, 3, 90
Media reporting 1, 2, 3, 90

中位数 22, 33, 37, 38, 42, 164, 173, 221, 487
Median 22, 33, 37, 38, 42, 164, 173, 221, 487

置信区间 185, 194
confidence interval for 185, 194

医学期刊 vii, 3, 4, 102, 465, 472, 477, 479-91, 492-3, 498
Medical journals vii, 3, 4, 102, 465, 472, 477, 479- 91, 492- 3, 498

政策 42, 168, 175, 387, 478, 489-90, 492, 493
policy 42, 168, 175, 387, 478, 489- 90, 492, 493

作用 493
role 493

另见 科学论文
see also Papers, scientific

初潮年龄 287, 298
Menarche, age of 287, 298

绝经 89, 105
Menopause 89, 105

月经周期 189-91, 434
Menstrual cycle 189- 91, 434

统计综述 270-1, 472-3
Meta- analysis 270- 1, 472- 3

代谢率 322, 333-4, 361
Metabolic rate 322, 333- 4, 361

方法比较研究 277, 284, 396-403, 484, 486
Method comparison studies 277, 284, 396- 403, 484, 486

396- 403, 484, 486

偏倚 398, 402
bias 398, 402

设计 402
design of 402

错误分析 284
erroneous analyses 284

样本量 402
sample size 402

偏头痛 83, 214
Migraine 83, 214

牛奶消费 288-90, 296, 457
Milk consumption 288- 90, 296, 457

最小化 91, 443-5
Minimization 91, 443- 5

流产 72, 91, 95
Miscarriage 72, 91, 95

缺失数据/值 109, 113, 115-16, 118, 123, 124, 130-2, 149, 326, 426, 429, 431, 433, 463, 484, 485
Missing data/value 109, 113, 115- 16, 118, 123, 124, 130- 2, 149, 326, 426, 429, 431, 433, 463, 484, 485

118, 123, 124, 130- 2, 149, 326, 426, 429, 431, 433, 463, 484, 485

429, 431, 433, 463, 484, 485

众数 22
Mode 22

模型 171, 174, 317, 340, 349-350, 352, 356-357, 430
Model 171, 174, 317, 340, 349- 50, 352, 356- 7, 430

356- 7, 430

参见回归
see also Regression

矩 136
Moments 136

出生月份 434
Month of birth 434

月亮 68
Moon 68

围产期死亡率 14, 19
Mortality, perinatal 14, 19

晕动症 368, 372, 375, 379, 380, 384, 385, 394
Motion sickness 368, 372, 375, 379, 380, 384, 385, 394

交通事故 101
Motoring accidents 101

高山病 503
Mountain sickness 503

多重RAST(MAST) 405-407, 409
Multi- RAST (MAST) 405- 7, 409

多维频数表 360
Multi- way frequency table 360

多中心研究 89, 375, 443, 455, 460
Multicentre study 89, 375, 443, 455, 460

多重比较 210-12, 215, 248, 261, 329, 336, 387, 465, 491
Multiple comparisons 210- 12, 215, 248, 261, 329, 336, 387, 465, 491

多重相关系数 346
Multiple correlation coefficient 346

多重计数 466, 486
Multiple counting 466, 486

多重逻辑回归 见逻辑回归
Multiple logistic regression see Logistic regression

多元回归 309-10, 326, 333, 334, 336-51, 414, 465, 467
Multiple regression 309- 10, 326, 333, 334, 336- 51, 414, 465, 467

调整后的 345-6, 351
adjusted 345- 6, 351

调整后的 345-6, 351
adjusted 345- 6, 351

全子集 345, 349
all subsets 345, 349

方差分析表 343
analysis of variance table for 343

假设条件 350-1
assumptions 350- 1

回归系数 336-7
coefficients 336- 7

置信区间 336, 351
confidence interval 336, 351

常数项 336
constant 336

因变量 340, 345, 346, 351
dependent variable 340, 345, 346, 351

自变量 336-7, 339-46, 348, 349-51
explanatory variable 336- 7, 339- 46, 348, 349- 51

拟合值 347-8, 351
fitted values 347- 8, 351

拟合优度 345-6, 349, 351
goodness- of- fit 345- 6, 349, 351

解释 337
interpretation 337

模型 336, 340-7
model 336, 340- 7

过拟合 341, 345
overfitting 341, 345

预测区间 351
prediction interval 351

预测变量 336-7, 339-46, 348, 349-51
predictor variable 336- 7, 339- 46, 348, 349- 51

表达 351, 489
presentation 351, 489

预后指数 337, 347-8
prognostic index 337, 347- 8

预后变量 336-7, 339-46, 348, 349-51
prognostic variable 336- 7, 339- 46, 348, 349- 51

345-6, 350, 351
345- 6, 350, 351

与方差分析的关系 325-6
relation to analysis of variance 325- 6

与偏相关的关系 348-9
relation to partial correlation 348- 9

残差 341, 344, 346-7, 349, 351
residuals 341, 344, 346- 7, 349, 351

残差标准差 346, 351
residual standard deviation 346, 351

样本量 349
sample size 349

标准误 336, 347
standard error 336, 347

逐步法 340-5, 349-50
stepwise 340- 5, 349- 50

解释的变异 340, 347
variation explained 340, 347

多重检验 338, 349, 371, 429, 433, 453-4, 465, 486
Multiple testing 338, 349, 371, 429, 433, 453- 4, 465, 486

乘法 N506
Multiplication N506

Mustine 444
Mustine 444

心肌梗死 94, 96, 97, 161, 386, 451, 464, 466, 474
Myocardial infarction 94, 96, 97, 161, 386, 451, 464, 466, 474

N N510,N512
N N510, N512

恶心 225
Nausea 225

阴性预测值 410-13,415-16,419
Negative predictive value 410- 13, 415- 16, 419

阴性结果 170,489
Negative result 170, 489

神经管缺陷 446,447
Neural tube defects 446, 447

《新英格兰医学杂志》 102,465,479,480
New England Journal of Medicine 102, 465, 479, 480

Newman-Keuls 检验 211
Newman- Keuls test 211

尼卡地平 467,469
Nicardipine 467, 469

尼古丁口香糖 459
Nicotine chewing gum 459

一氧化二氮 207
Nitrous oxide 207

阳性淋巴结 89, 375, 382, 462
Nodes, positive 89, 375, 382, 462

噪声,响亮 276
Noise, loud 276

名义数据 11
Nominal data 11

计量图 456-60
Nomogram 456- 60

非线性变换 41
Non- linear transformation 41

非线性趋势 213, 296-7, 310, 350
Non- linear trend 213, 296- 7, 310, 350

非正态分布 136-42, 185, 306
Non- Normal distribution 136- 42, 185, 306

非参数方法 51, 130, 133, 145, 171-3, 180, 189, 199, 207, 223
Non- parametric methods 51, 130, 133, 145, 171- 3, 180, 189, 199, 207, 223

145, 171-3, 180, 189, 199, 207, 223
145, 171- 3, 180, 189, 199, 207, 223

用于分类数据 265, 266, 271
for categorical data 265, 266, 271

连续数据见199,203-205,213-216,365页
for continuous data 199, 203- 5, 213- 16, 365

相关性见279,285-288页
for correlation 279, 285- 8

循环数据见434页
for cyclic data 434

参考范围见423页
for reference range 423

回归见318页
for regression 318

生存数据见371-375,379-383页
for survival data 371- 5, 379- 83

另见具体方法
see also specific methods

无反应见98,100页
Non response 98, 100

无显著性见统计显著性
Non- significant see Statistical significance

非甾体抗炎药(NSAID)见465页
Nonsteroidal antiinflammatory drugs (NSAID) 465

正态近似 173, 196, 197
Normal approximation 173, 196, 197

对二项分布的近似 66, 155, 157, 161, 230, 231, 239, 370, 459
to Binomial distribution 66, 155, 157, 161, 230, 231, 239, 370, 459

对卡方分布的近似 244-6, 258
to Chi squared distribution 244- 6, 258

对泊松分布的近似 66, 246
to Poisson distribution 66, 246

正态分布 51-60, 71
在方差分析中的应用 330, 334
数据中的应用 132, 172, 174, 180, 182, 199, 206, 223, 279, 330, 359, 457, 490
随机样本的正态分布 60, 120, 153, 155-7
参考范围 420-1, 423, 425
回归分析中的应用 303-6, 318, 346
作为抽样分布 154, 162, 165, 166-7, 173, 177, 181, 231, 233, 267
标准正态分布 54, 244, 258, 421
分布的关系 181
正态分布表 515-20
正态性检验 133-43, 166-7, 207, 279, 291-2, 303-6, 346, 421
正态变换 61-3, 133-6, 143-5, 199, 207, 421, N510
正态图 133-43, 149, 207, 279, 291-2, 303, 346, 347, 421
正态概率纸 142
正态范围 参见参考范围
正态分数 54, 142-3, 292, 353, 425
正态性检验 133-43, 166-7, 207, 279, 303-6, 346, 421
符号表示 9, N505-13
核电站 1
零假设 165-8, 170
受试者数量 参见样本量
分子 487, N507
数值精度 17, 18, 70-1, 487-8
肥胖 193, 194, 413
与高血压的关系 352-5
研究目标 120, 180-1, 430, 454
观察 参见数据
观察性研究(调查) 6, 74, 75-6, 91-3, 99-101, 102-3, 152, 325, 336, 370, 426, 467
观察者
比较 397, 401, 403-9
变异性 402-3, 419, 438
职业 19
眼部暴露 216
优势比 268, 352, 417
对数优势比 352
事后检验 417
事前检验 417
Normal distribution 51- 60, 71 in analysis of variance 330, 334 of data 132, 172, 174, 180, 182, 199, 206, 223, 279, 330, 359, 457, 490 random samples from 60, 120, 153, 155- 7 and reference range 420- 1, 423, 425 in regression 303- 6, 318, 346 as sampling distribution 154, 162, 165, 166- 7, 173, 177, 181, 231, 233, 267 standard 54, 244, 258, 421 relation to distribution 181 tables of 515- 20 testing for 133- 43, 166- 7, 207, 279, 291- 2, 303- 6, 346, 421 transformation to 61- 3, 133- 6, 143- 5, 199, 207, 421, N510 Normal plot 133- 43, 149, 207, 279, 291- 2, 303, 346, 347, 421 Normal probability paper 142 Normal range see Reference range Normal score 54, 142- 3, 292, 353, 425 Normality, test of 133- 43, 166- 7, 207, 279, 303- 6, 346, 421 Notation 9, N505- 13 Nuclear power installation 1 Null hypothesis 165- 8, 170 Number of subjects see Sample size Numerator 487, N507 Numerical precision 17, 18, 70- 1, 487- 8 Obesity 193, 194, 413 and hypertension 352- 5 Objective of research 120, 180- 1, 430, 454 Observation see Data Observational study (survey) 6, 74, 75- 6, 91- 3, 99- 101, 102- 3, 152, 325, 336, 370, 426, 467 Observers comparison of 397, 401, 403- 9 variation among 402- 3, 419, 438 Occupation 19 Ocular exposure 216 Odds 268, 352, 417 log 352 post- test 417 pre- test 417

比值比 146, 268-71, 352, 354,合并 表格 271,置信区间 269-70,逻辑回归中 352-4,配对样本 269-70,与相对风险的关系 271,两个独立样本 268-9
信息遗漏 7, 474, 482, 490-1
单侧(单尾)检验 170-1, 177, 214
开放性研究 441
民意调查 1
口服避孕药 1, 50, 95, 105
有序(序数)分类数据 11, 180, 229, 249, 261-5, 272, 319, 409, 434, 486
整骨疗法 38
结局指标 见 临床试验,结局指标
离群值 113, 126-30, 139, 145, 149, 342, 346
过拟合 341, 345
过度乐观 349, 351, 442
概览 270-1, 472-3
排卵 360
氧气 442
Odds ratio 146, 268- 71, 352, 354 for combining tables 271 confidence interval 269- 70 in logistic regression 352- 4 paired samples 269- 70 relation to relative risk 271 two independent samples 268- 9 Omission of information 7, 474, 482, 490- 1 One- sided (one- tailed) test 170- 1, 177, 214 Open study 441 Opinion poll 1 Oral contraceptive 1, 50, 95, 105 Ordinal (ordered categorical) data 11, 180, 229, 249, 261- 5, 272, 319, 409, 434, 486 Osteopathy 38 Outcome measure see Clinical trial, outcome measure Outliers 113, 126- 30, 139, 145, 149, 342, 346 Over- fitting 341, 345 Over- optimism 349, 351, 442 Overview 270- 1, 472- 3 Ovulation 360 Oxygen 442

P值 166, 175, 271, 344, 345, 465, 473, 485, 498, N511, N512
调整后的 221
比较 467, 486
精确 168, 171, 176, 196, 253
解释 167-9, 177, 222-3, 429, 485, 489
局限性 174-5
过度依赖 169-70
呈现 175, 176-7, 220, 221, 487
双侧或单侧 170-1, 177, 214
另见 假设检验;统计显著性
红细胞压积(PCV) 288, 348
疼痛 16, 38-9, 233-5, 441
配对数据 180, 486, 490
另见 组间比较
P value 166, 175, 271, 344, 345, 465, 473, 485, 498, N511, N512 adjusted 221 comparison of 467, 486 exact 168, 171, 176, 196, 253 interpretation 167- 9, 177, 222- 3, 429, 485, 489 limitations 174- 5 over- reliance 169- 70 presentation 175, 176- 7, 220, 221, 487 two- sided or one- sided 170- 1, 177, 214 see also Hypothesis test; Statistical significance Packed cell volume (PCV) 288, 348 Pain 16, 38- 9, 233- 5, 441 Paired data 180, 486, 490 see also Comparison of groups

科学论文 4, 7, 454, 455, 477-99
评估 177, 473-4, 480, 481-92, 494
清单 494-7
置信区间 177, 486, 487, 489-90, 498
错误 见 论文中的统计错误
图表 488-9
方法部分 7, 454
结果呈现 18, 176-7, 220-2, 471
阅读 493-7
审稿 478, 492, 493, 494
综述 455, 465-6, 473, 479-91
统计指南 494, 498
统计方法使用 174, 479-80
写作 471, 473, 498-9
另见 医学期刊;统计材料的呈现
平行组 447
参数 51, 54, 171, 181
参数方法 51, 58, 143, 145, 171-2, 173, 180, 181, 189, 223, 288, 421, 423, 425
另见 具体方法
偏相关 288-91, 296, 297, 348-9
与多元回归的关系 348-9
参与率 见 响应率
被动吸烟 91
PBC 见 原发性胆汁性肝硬化
PCV 见 红细胞压积
最大呼气流速 82, 397
峰值 430
Pearson, K. 278
Pearson积差相关系数(r)见 相关系数
儿科 479
PEmax 336, 339, 341, 342-5, 347, 350
百分比 13, 271, 487
百分位数 见 百分位
围产期死亡率 14, 19
苯巴比妥 470
pH 145, 422
小猪 142
Papers, scientific 4, 7, 454, 455, 477- 99 assessing 177, 473- 4, 480, 481- 92, 494 checklists for 494- 7 confidence intervals in 177, 486, 487, 489- 90, 498 errors in see Statistical errors in papers figures in 488- 9 Methods section 7, 454 presentation of results 18, 176- 7, 220- 2, 471 reading 493- 7 refereeing of 478, 492, 493, 494 reviews of 455, 465- 6, 473, 479- 91 statistical guidelines for 494, 498 statistical methods used in 174, 479- 80 writing 471, 473, 498- 9 see also Medical journals; Presentation of statistical material Parallel groups 447 Parameter 51, 54, 171, 181 Parametric methods 51, 58, 143, 145, 171- 2, 173, 180, 181, 189, 223, 288, 421, 423, 425 see also specific methods Partial correlation 288- 91, 296, 297, 348- 9 relation to multiple regression 348- 9 Participation rate see Response rate Passive smoking 91 PBC see Primary biliary cirrhosis PCV see Packed cell volume Peak expiratory flow rate 82, 397 Peak value 430 Pearson, K. 278 Pearson's product moment correlation coefficient (r) see Correlation coefficient Pediatrics 479 PEmax 336, 339, 341, 342- 5, 347, 350 Percentage 13, 271, 487 Percentile see Centile Perinatal mortality 14, 19 Phenobarbitone 470 pH 145, 422 Piglets 142

试点研究 460
PImax 21-2, 23, 28, 35, 40, 109
安慰剂 450-1, 452
安慰剂效应 451
研究设计 75, 483, 484
体积描记法 364
胸腔积液 444
绘图数据 见 图表
泊松分布 66-8, 70-1, 145, 246
多导记录仪 437
多项式回归 310
总体 5, 50
另见 样本
猪肉 297
阳性预测值 410-13, 415-16, 419
阳性结果 170, 489
后验概率 416
效能 169, 170, 261, 450, 465, 469
与样本量关系 169, 170, 189, 393, 455-60, 474, 484-5
Powers N507-8
先兆子痫毒血症 476
数据精度 487
估计精度 80, 83, 465, 487
数值精度 17, 18, 70-1, 487-8
虚假精度 42, 121, 487-8
预测 277, 340, 410
另见 回归
预测区间 307, 315, 316, 319, 320, 321
预测变量 在Cox回归中 388, 389, 392
在线性回归中 301, 303, 317
在逻辑回归中 352, 355
在多元回归中 336-7, 339-46, 348, 349-51
泼尼松龙 178, 273-4
妊娠 95, 101, 124, 423, 484
妊娠期血压 426
高血压 275, 476
尿酸 39
体重增加 6
统计材料的呈现 3, 494
方差分析 220-2
卡方检验 271
Pilot study 460 PImax 21- 2, 23, 28, 35, 40, 109 Placebo 450- 1, 452 Placebo effect 451 Planning a study 75, 483, 484 Plethysmography 364 Pleural effusions 444 Plotting data see Graphs Poisson distribution 66- 8, 70- 1, 145, 246 Polygraph 437 Polynomial regression 310 Population 5, 50 see also Sample Pork 297 Positive predictive value 410- 13, 415- 16, 419 Positive result 170, 489 Posterior probability 416 Power 169, 170, 261, 450, 465, 469 and sample size 169, 170, 189, 393, 455- 60, 474, 484- 5 Powers N507- 8 Pre- eclamptic toxaemia 476 Precision of data 487 of estimates 80, 83, 465, 487 numerical 17, 18, 70- 1, 487- 8 spurious 42, 121, 487- 8 Prediction 277, 340, 410 see also Regression Prediction interval 307, 315, 316, 319, 320, 321 Predictor variable in Cox regression 388, 389, 392 in linear regression 301, 303, 317 in logistic regression 352, 355 in multiple regression 336- 7, 339- 46, 348, 349- 51 Prednisolone 178, 273- 4 Pregnancy 95, 101, 124, 423, 484 blood pressure in 426 hypertension in 275, 476 uric acid in 39 weight gain in 6 Presentation of statistical material 3, 494 analysis of variance 220- 2 Chi squared test 271

统计材料的呈现(续) Presentation of statistical material (cont.)

比例比较 271 置信区间 176-177
comparison of proportions 271 confidence intervals 176- 7

相关性 299
correlation 299

数据 18, 42-45, 221, 271, 433
data 18, 42- 5, 221, 271, 433

假设检验 175, 176-177, 220, 221, 487
hypothesis tests 175, 176- 7, 220, 221, 487

均值 42, 487, 488, 489
mean 42, 487, 488, 489

误导 453-454, 487-489, 491
misleading 453- 4, 487- 9, 491

回归 319-320, 351, 487-488, 489
regression 319- 20, 351, 487- 8, 489

结果 18, 176-177, 220-222
results 18, 176- 7, 220- 2

标准差 42, 222, 487, 488, 490
standard deviation 42, 222, 487, 488, 490

标准误 221-222, 487, 488, 490
standard error 221- 2, 487, 488, 490

生存分析 393-4
survival analysis 393- 4

t检验 220-2
t test 220- 2

参见 图表
see also Graphs

患病率 407, 411, 412, 413-14, 415-16, 418-19, 423
Prevalence 407, 411, 412, 413- 14, 415- 16, 418- 19, 423

牧师 101
Priests 101

原发性胆汁性肝硬化(PBC) 52, 54
Primary biliary cirrhosis (PBC) 52, 54,

60- 1, 136, 148, 155, 157, 160, 164, 166, 389, 391, 392, 393

先验概率 416
Prior probability 416

监狱 101
Prison 101

概率 48-50, 53, 57, 69, 257-8, 354, 356, 417
Probability 48- 50, 53, 57, 69, 257- 8, 354, 356, 417

354, 356, 417

条件概率 368
conditional 368

密度 53
density 53

分布 50-1
distribution 50- 1

在假设检验中见 P 值
in hypothesis tests see P value

先验 416
prior 416

后验 416
posterior 416

乘积 (II) N509
Product (II) N509

积矩相关系数见相关系数
Product moment correlation coefficient see Correlation coefficient

孕酮 426, 431, 433
Progesterone 426, 431, 433

预后 389, 391, 413, 414, 442, 451, 465
Prognosis 389, 391, 413, 414, 442, 451, 465

预后指数 337, 347-8, 390-1, 414 另见风险评分
Prognostic index 337, 347- 8, 390- 1, 414 see also Risk score

预后变量 91, 375, 382, 448, 461-2, 473
Prognostic variable 91, 375, 382, 448, 461- 2, 473

另见 预测变量
see also Predictor variable

比例 66, 154-5, 157, 229, 352, 358, 434, N506
Proportion 66, 154- 5, 157, 229, 352, 358, 434, N506

多个样本 241, 259-65, 319
several samples 241, 259- 65, 319

单一样本 230-2
single sample 230- 2

比例(续)
Proportion (cont.)

标准误差,见 标准误差变换 145-6, 352
standard error of see Standard error transformation of 145- 6. 352

两个配对样本 235-41, 258-9, 269-70
two paired samples 235- 41. 258- 9. 269- 70

两个非配对(独立)样本 232-5, 250-8, 259-65, 266-9
two unpaired (independent) samples 232- 5, 250- 8, 259- 65, 266- 9

与卡方检验的等价性
equivalence to Chi squared test

257- 8, 259, 271

参见 Logistic 回归:生存时间数据,生存比例
see also Logistic regression: Survival time data, survival proportion

比例风险回归 387-93
Proportional hazards regression 387- 93

前瞻性研究 74, 76, 266
Prospective study 74, 76. 266

方案 146, 454-5, 463-4, 466, 485
Protocol 146, 454- 5. 463- 4. 466. 485

瘙痒 470-1, 474
Pruritus 470- 1, 474

伪纵向研究 76, 101
Pseudo- longitudinal study 76. 101

伪随机分配 446, 485, 494
Pseudo- random allocation 446, 485. 494

精神病患者 101
Psychiatric patients 101

发表偏倚 169-70, 472-3, 483
Publication bias 169- 70, 472- 3. 483

肺结核 440, 478
Pulmonary tuberculosis 440, 478

二次曲线 310, 317, 319, 424
Quadratic curve 310, 317, 319, 424

定性数据 参见 分类数据
Qualitative data see Categorical data

质量控制 149
Quality control 149

分位数 133
Quantile 133

另见 百分位数
see also Centile

定量数据 参见 连续数据
Quantitative data see Continuous data

四分位数 34
Quartile 34

问卷 98, 100
Questionnaire 98, 100

r 参见 相关系数
r see Correlation coefficient

308, 345-6, 350, 351
308, 345- 6, 350, 351

调整的 345-6,351
adjusted 345- 6. 351

放射性过敏吸附试验(RAST) 405-7,409
Radioallergosorbent test (RAST) 405- 7, 409

放射科医师 403-4
Radiologists 403- 4

随机
Random

分配 85-90,442-3,482,491
allocation 85- 90, 442- 3. 482, 491

数字 86,120,285
numbers 86, 120, 285

计算机生成的 86,120
computer generation of 86, 120

表格 540-4
table of 540- 4

样本 6,60,82,153,155-7,279,283
sample 6, 60, 82, 153, 155- 7, 279, 283

变异 19,78,316,329,472
variation 19, 78, 316, 329, 472

随机化 79, 80, 81-2
Randomization 79, 80, 81- 2

区组 87-8, 89
block 87- 8, 89

群组 90
cluster 90

限制性 87
restricted 87

简单 86-7, 89
simple 86- 7, 89

随机化(纠正) Randomization (corr.)

分层 88-9 加权 87 另见 临床试验,随机化
范围 31, 221 中心 33, 57 检查 124, 149 四分位间距 33 正态/参考 见 参考区间
秩 13, 22, 33, 173, 185, 205, 213-14, 286, 365
秩相关 265, 279, 285-8, 295-6, 297
秩方法 见 非参数方法
秩和检验 见 Wilcoxon检验
速率 14
比率 14, 202
雷诺现象 467
回忆偏倚 94-5
受试者工作特征(ROC)曲线 417-18
倒数 N508 变换 141-5
直肠活检 359
红细胞容量 435
审稿 478, 492, 493, 494
参考区间 76, 419-26
参考区间的置信区间 422-3, 425
与年龄的关系 423-6
样本选择 420, 426
样本大小 420, 422, 425
使用经验(百分位数) 420, 421-3, 425
使用正态分布 420-3
使用变换 420-1
参考范围 见 参考区间
回归
区别于相关 277, 320-1
线性 见 线性回归
逻辑回归 见 逻辑回归
多元回归 见 多元回归
多项式 310
展示 319-20, 351, 487-8, 489
生存数据回归 见 Cox回归
均值回归 285
stratified 88- 9 weighted 87 see also Clinical trial, randomization Range 31, 221 central 33, 57 checking 124, 149 inter- quartile 33 normal/reference see Reference interval Rank 13, 22, 33, 173, 185, 205, 213- 14, 286, 365 Rank correlation 265, 279, 285- 8, 295- 6, 297 Rank methods see Non- parametric methods Rank sum test see Wilcoxon test Rate 14 Ratio 14, 202 Raynaud's phenomenon 467 Recall bias 94- 5 Receiver operating characteristic (ROC) curve 417- 18 Reciprocal N508 transformation 141- 5 Rectal biopsy 359 Red cell volume 435 Refereeing 478, 492, 493, 494 Reference interval 76, 419- 26 confidence interval for 422- 3, 425 relation to age 423- 6 sample selection 420, 426 sample size 420, 422, 425 using empirical (per)centiles 420, 421- 3, 425 using Normal distribution 420- 3 using transformation 420- 1 Reference range see Reference interval Regression distinction from correlation 277, 320- 1 linear see Linear regression logistic see Logistic regression multiple see Multiple regression polynomial 310 presentation of 319- 20, 351, 487- 8, 489 for survival data see Cox regression to the mean 285

部分与整体的关系 285
相对频率 27, 29, 51
相对风险 266-8, 271
放松反应 214
结果的可靠性 4, 81
肾功能衰竭 127, 226
重复性 401, 402, 419
重复测量 80, 327-31, 434
重复观察 82, 217, 331-3, 401
代表性样本 6, 75, 78, 82, 100, 153, 160, 368, 451, 490
再现性 331-3
研究争议 3, 5, 96, 102
步骤 5
发现的有效性 4
残差
方差分析中的残差 207, 328, 330, 334
线性回归中的残差 301, 303-6, 308, 313, 423
多元回归中的残差 341, 344, 346-7, 349, 351
残差图 303-6, 346-7, 351
残差标准差 209, 221, 308, 313, 319, 320, 346, 351
残差方差 207, 302, 308, 313, 329, 330
残差变异 328
呼吸肌力量 22, 336
速率 438
呼吸道感染 438
反应率 7, 490
治疗反应 387
反应变量 301
静息代谢率(RMR) 322, 361
结果矛盾 3, 5, 96, 102
结果检查 114
结果解释 5, 8, 81, 94, 174, 177, 337, 442, 471-3, 482, 489-90
结果展示 18, 176-7, 220-2, 271, 393-4
视网膜前纤维母细胞增生 441-2
回顾性研究 74, 76, 91, 484
Relating a part to the whole 285 Relative frequency 27, 29, 51 Relative risk 266- 8, 271 Relaxation response 214 Reliability of results 4, 81 Renal failure 127, 226 Repeatability 401, 402, 419 Repeated measurements 80, 327- 31, 434 Replicate observations 82, 217, 331- 3, 401 Representative sample 6, 75, 78, 82, 100, 153, 160, 368, 451, 490 Reproducibility 331- 3 Research controversies in 3, 5, 96, 102 steps in 5 validity of findings 4 Residual in analysis of variance 207, 328, 330, 334 in linear regression 301, 303- 6, 308, 313, 423 in multiple regression 341, 344, 346- 7, 349, 351 Residual plot 303- 6, 346- 7, 351 Residual standard deviation 209, 221, 308, 313, 319, 320, 346, 351 Residual variance 207, 302, 308, 313, 329, 330 Residual variation 328 Respiratory muscle strength 22, 336 rate 438 tract infection 438 Response rate 7, 490 Response to treatment 387 Response variable 301 Resting metabolic rate (RMR) 322, 361 Results contradictory 3, 5, 96, 102 checking 114 interpretation of 5, 8, 81, 94, 174, 177, 337, 442, 471- 3, 482, 489- 90 presentation of 18, 176- 7, 220- 2, 271, 393- 4 Retrolental fibroplasia 441- 2 Retrospective study 74, 76, 91, 484

文献综述 455, 465-6, 473, 479-91
Reviews of the literature 455, 465- 6, 473, 479- 91

流变学研究 485
Rheological studies 485

类风湿关节炎 45, 274-5, 436, 465
Rheumatoid arthritis 45, 274- 5, 436, 465

利福平 470
Rifampicine 470

风险 354
Risk 354

另见 相对风险
see also Relative risk

风险因素 95
Risk factors 95

另见 预后因素
see also Prognostic factors

风险比 另见 相对风险
Risk ratio see Relative risk

风险评分 357
Risk score 357

另见 预后指数
see also Prognostic index

交通事故伤亡 24-6, 101
Road accident casualties 24- 6, 101

鲁棒性 197
Robustness 197

ROC 曲线 417-18
ROC curve 417- 18

四舍五入 17, 18, 71, 312
Rounding 17, 18, 71, 312

皇家统计学会 478-9
Royal Statistical Society 478- 9

跑一英里 317
Running a mile 317

发育不良者 142
Runts 142

沙门氏菌 1
Salmonella 1

盐 223
Salt 223

样本 5, 223
Sample 5, 223

有偏的 7
biased 7

作为总体的估计 5, 6, 8, 34, 50, 152, 157, 160, 451, 471
as estimate of population 5, 6, 8, 34, 50, 152, 157, 160, 451, 471

随机样本 6, 60, 82, 153, 155-7, 279, 283
random 6, 60, 82, 153, 155- 7, 279, 283

具有代表性的样本 6, 75, 78, 82, 100, 153, 160, 368, 451, 490
representative 6, 75, 78, 82, 100, 153, 160, 368, 451, 490

样本选择 6, 82-3, 425, 426
Sample selection 6, 82- 3, 425, 426

另见 临床试验,纳入标准
see also Clinical trial, eligibility criteria

样本量
Sample size

临床试验中的样本量 443, 447, 452, 454, 455-60, 464, 474, 484
for clinical trial 443, 447, 452, 454, 455- 60, 464, 474, 484

比较比例时的样本量 187, 189, 248, 253
for comparing proportions 187, 189, 248, 253

相关分析中的样本量 298
for correlation 298

判别分析中的样本量 359
for discriminant analysis 359

和假设检验 167, 169, 455-59, 485
and hypothesis test 167, 169, 455- 9, 485

不足的 170, 455, 484
inadequate 170, 455, 484

在多元回归中 349
in multiple regression 349

作为设计的一部分 6, 83, 443, 455-60, 464, 485
as part of design 6, 83, 443, 455- 60, 464, 485

样本量(续)
Sample size (cont.)

与检验力 169, 170, 189, 393, 455-60, 474, 484-85
and power 169, 170, 189, 393, 455- 60, 474, 484- 5

与参考区间 420, 422, 425
and reference interval 420, 422, 425

与抽样分布 143, 154, 161, 181, 248, 253, 378
and sampling distribution 143, 154, 161, 181, 248, 253, 378

在生存分析中 376, 378, 386, 393
in survival analysis 376, 378, 386, 393

不等 459-60
unequal 459- 60

抽样分布 153-9, 162, 171, 173, 177, 197, 230, 233, 354
Sampling distribution 153- 9, 162, 171, 173, 177, 197, 230, 233, 354

及样本量 143, 154, 161, 181, 248, 253, 378
and sample size 143, 154, 161, 181, 248, 253, 378

及样本量 143, 154, 161, 181, 248, 253, 378
and sample size 143, 154, 161, 181, 248, 253, 378

248, 253, 378

抽样单位 431, 466
Sampling unit 431, 466

抽样变异 136, 155, 167, 170, 422
Sampling variation 136, 155, 167, 170, 422

散点图/图表 40, 43, 113, 125, 133, 279, 299, 319, 342, 397, 469, 488, 489
Scatter plot/diagram 40, 43, 113, 125, 133, 279, 299, 319, 342, 397, 469, 488, 489

Scheffé, H. 211
Scheffé, H. 211

精神分裂症 103
Schizophrenia 103

科学论文 参见 Papers, scientific
Scientific papers see Papers, scientific

评分 14, 15, 172, 413
Scores 14, 15, 172, 413

有序组的 215-16, 262, 264-5, 318, 339
of ordered groups 215- 16, 262, 264- 5, 318, 339

264- 5, 318, 339

筛查 414, 418-19
Screening 414, 418- 19

乳腺癌筛查 366-7
for breast cancer 366- 7

宫颈癌筛查 95
for cervical cancer 95

SD 见 标准差
SD see Standard deviation

SE 见 标准误
SE see Standard error

季节性变化 71, 148, 434
Seasonal variation 71, 148, 434

安全带 101, 162, 165
Seat belts 101, 162, 165

SEE 320
SEE 320

受试者选择 见 样本选择
Selection of subjects see Sample selection

变量选择 337, 340-345, 359, 389
Selection of variables 337, 340- 5, 359, 389

标准误差(SEM)160
SEM 160

敏感性 410, 412, 413, 415-416, 418, 419
Sensitivity 410, 412, 413, 415- 16, 418, 419

序贯试验 448-449, 455
Sequential trial 448- 9, 455

连续测量 125, 331, 426-433, 434, 489
Serial measurements 125, 331, 426- 33, 434, 489

曲线下面积(AUC)430, 431-433
area under the curve (AUC) 430, 431- 3

图形展示 426, 431
graphical display 426, 431

解释 433
interpretation 433

常用方法的问题 426-427, 433
problems with usual approach 426- 7, 433

连续测量(续)汇总指标 429-431, 433
婴儿性别 49, 71-72, 177
Shapiro-Francia 检验 139, 291, 303-306, 330
相关表格 538-539
Shapiro-Wilk 检验 139, 166, 279, 291
鞋码 229, 261-265, 319
兄弟姐妹性别比例 273
符号检验 186-187, 204, 205, 240, 336
显著性水平 168, 344-345
显著性检验 见 假设检验
显著的 见 统计显著性
有效数字(位数)320
模拟 120, 155, 157
歌唱声音 273
正弦曲线 434-435
样本大小 见 样本量
大小与生长 101
偏度 36-38, 136 见 也分布偏斜
皮肤问题 474
睡眠困难 236-239
回归线斜率 见 线性回归
吸烟 2, 7, 95, 96, 357, 459
与癌症相关 93, 275
与高血压相关 352-355
被动吸烟 91
与牙齿萌出相关 31
与尿中可替宁排泄相关 226
见 也香烟
打鼾与心脏病 264
与高血压相关 352-355
社会阶层 7
硫代硫酸金钠(SA)45, 274-275
软件 见 计算机程序
空间失调 223
穿梭 223
Spearman,C. 286
Spearman等级相关系数 286-288, 295-296, 297
相关表格 530
特异性 410, 412, 413, 415-416, 418, 419, 420
眼镜 216, 254-257
Serial measurements (cont.) summary measures 429- 31, 433 Sex of baby 49, 71- 2, 177 Shapiro- Francia test 139, 291, 303- 6, 330 table for 538- 9 Shapiro- Wilk test 139, 166, 279, 291 Shoe size 229, 261- 5, 319 Siblings, sex ratio of 273 Sign test 186- 7, 204, 205, 240, 336 Significance level 168, 344- 5 Significance test see Hypothesis test Significant see Statistical significance Significant figures (digits) 320 Simulation 120, 155, 157 Singing voices 273 Sinusoidal (sine) curve 434- 5 Size of sample see Sample size Size versus growth 101 Skewness 36- 8, 136 see also Distribution, skewed Skin problems 474 Sleeping difficulties 236- 9 Slope of regression line see Linear regression Smoking 2, 7, 95, 96, 357, 459 and cancer 93, 275 and hypertension 352- 5 passive 91 and tooth eruption 31 and urinary cotinine excretion 226 see also Cigarettes Snoring and heart disease 264 and hypertension 352- 5 Social class 7 Sodium aurothiomalate (SA) 45, 274- 5 Software see Computer program Space deconditioning 223 shuttle 223 Spearman, C. 286 Spearman's rank correlation coefficient 286- 8, 295- 6, 297 table for 530 Specificity 410, 412, 413, 415- 16, 418, 419, 420 Spectacles (glasses) 216, 254- 7

血压计 79, 148
电子表格 112
平方 N507
平方根 N508
平方根变换 41, 143, 145, 202
疾病分期 12, 96, 172, 375
标准差 (SD) 33-6, 37, 38, 41, 42, 153, 154, 155, 171, 181, 398, 422-4, 457, N506
标准差的呈现 42, 222, 487, 488, 490
不同研究组中相似 143, 180, 192, 197-8, 199-201, 206
另见 方差
标准差得分 425
百分位数的标准误 (SE) 422
均值差异的标准误 160-1, 190, 192
比例差异的标准误 162, 233, 234, 237, 240
估计值的标准误 (SEE) 320
标准误的呈现 221-2, 487, 488, 490
回归中的标准误 306, 314-15, 319, 336, 347
样本均值的标准误 154, 160, 162, 221-2
样本比例的标准误 161-2, 165, 230
生存比例的标准误 370, 378
用于构建置信区间 165, 235
标准正态偏差 见 正态得分
标准化差异 457, 458, 459
标准化死亡率比 14
统计量 N506
统计分析 见 数据分析
论文中的统计错误 4, 261, 477-8, 479, 491, 498
分析中的统计错误 261, 385-7, 401-2, 426-7, 477, 482, 486-7
后果 453, 482, 491-2
设计中的统计错误 473-4, 477-8, 482-5
执行中的统计错误 485
解释中的统计错误 482, 489-90
遗漏的统计错误 7, 474, 482, 490-1
呈现中的统计错误 453-4, 487-9, 491
错误原因 492
Sphygmomanometer 79, 148 Spreadsheet 112 Square N507 Square root N508 Square root transformation 41, 143, 145, 202 Stage of disease 12, 96, 172, 375 Standard deviation (SD) 33- 6, 37, 38, 41, 42, 153, 154, 155, 171, 181, 398, 422- 4, 457, N506 presentation of 42, 222, 487, 488, 490 similar in different study groups 143, 180, 192, 197- 8, 199- 201, 206 see also Variance Standard deviation score 425 Standard error (SE) of centile 422 of difference between means 160- 1, 190, 192 of difference between proportions 162, 233, 234, 237, 240 of estimate (SEE) 320 presentation of 221- 2, 487, 488, 490 in regression 306, 314- 15, 319, 336, 347 of sample mean 154, 160, 162, 221- 2 of sample proportion 161- 2, 165, 230 of survival proportion 370, 378 use in constructing confidence intervals 165, 235 Standard Normal deviate see Normal score Standardized difference 457, 458, 459 Standardized mortality ratio 14 Statistic N506 Statistical analysis see Analysis of data Statistical errors in papers 4, 261, 477- 8, 479, 491, 498 in analysis 261, 385- 7, 401- 2, 426- 7, 477, 482, 486- 7 consequences of 453, 482, 491- 2 in design 473- 4, 477- 8, 482- 5 in execution 485 in interpretation 482, 489- 90 of omission 7, 474, 482, 490- 1 in presentation 453- 4, 487- 9, 491 reasons for 492

作者统计指南 494, 498
临床试验 494
流行病学研究 494
统计推断 见 推断
统计建模 171, 173-4, 317
另见 模型;回归
统计显著性 168-71, 174, 455-6, 472, 489
与临床重要性 170, 177
另见 假设检验;P值
统计学家 75, 102, 145, 282, 493
统计学概况 1-3
医学研究中的统计学 4-8, 174, 478-81
医学中的统计学 3-4
对统计学的不信任 3
统计学的误用 vii, 481, 486
另见 论文中的统计错误
统计学使用的综述 455, 465-6, 473, 479-91
统计学范围 5-8
统计学教学 vii
茎叶图 28
阶梯函数 369, 386
逐步回归 340-5, 349-50, 355, 389
B组链球菌 224
链霉素 440, 478
压力性生活事件 269-70
中风 451
搏出量 (SV) 397-400
Stuart-Maxwell检验 266
Student 181
Student t检验 见 t检验
研究设计 见 设计
亚组分析 466-7, 472, 473, 486
受试者 见 样本选择
下标 N505-6
子集分析 见 亚组分析
自杀 91
硫氧化指数 (SI) 45, 228, 274-5
平方和 35, 208, 218-20, 311
汇总测量/统计量 429-31, 433, 487, 488
另见 数据描述
求和 (Σ) 35, N508
Statistical guidelines for authors 494, 498 clinical trials 494 epidemiological studies 494 Statistical inference see Inference Statistical modelling 171, 173- 4, 317 see also Model; Regression Statistical significance 168- 71, 174, 455- 6, 472, 489 and clinical importance 170, 177 see also Hypothesis test; P value Statistician 75, 102, 145, 282, 493 Statistics at large 1- 3 in medical research 4- 8, 174, 478- 81 in medicine 3- 4 mistrust of 3 misuse of vii, 481, 486 see also Statistical errors in papers reviews of use of 455, 465- 6, 473, 479- 91 scope of 5- 8 teaching of vii Stem- and- leaf diagram 28 Step function 369, 386 Stepwise regression 340- 5, 349- 50, 355, 389 Streptococcus, group B 224 Streptomycin 440, 478 Stressful life events 269- 70 Stroke 451 Stroke volume (SV) 397- 400 Stuart- Maxwell test 266 Student 181 Student's t test see t test Study design see Design Subgroup analyses 466- 7, 472, 473, 486 Subjects see Sample selection Subscript N505- 6 Subset analyses see Subgroup analyses Suicide 91 Sulphoxidation index (SI) 45, 228, 274- 5 Sum of squares 35, 208, 218- 20, 311 Summary measures/statistics 429- 31, 433, 487, 488 see also Data description Summation (Σ) 35, N508

太阳镜 216
上标 N507
外科 450
乳腺癌 365, 375
心脏搭桥 207
监视偏倚 99
调查 见 观察性研究
生存时间数据 16, 125, 365-94
精算法 371
组间比较 371-6, 379-85, 386-7
计算 366, 370, 375, 377, 379, 389, 391
置信区间 369-70, 371, 376, 378-9, 383-5, 391
Cox回归 387-93
设计考虑 393
图形呈现 369, 376, 385, 386-7, 394
风险 388, 390-1
风险函数 388, 392
风险比 375-6, 383-4, 385, 390
错误分析 385-7
Kaplan-Meier生存曲线 368-71, 377-9, 384, 385, 386, 394
生命表 368, 371, 394
logrank检验 371-5, 379-83, 385, 386, 394
平均生存时间 386
中位生存时间 369, 376, 384-5, 386
结果呈现 393-4
预后指数 390-1
比例风险回归 387-93
治疗反应 387
样本量 376, 378, 386, 455
生存比例/概率 366, 367-71, 376, 377-9, 384, 386-7, 388, 390-1
甜味剂 102
游泳者 250-2, 269
系统分配 446, 485, 494
Sunglasses 216 Superscript N507 Surgery 450 breast cancer 365, 375 cardiac bypass 207 Surveillance bias 99 Survey see Observational study Survival time data 16, 125, 365- 94 actuarial method 371 comparison of groups 371- 6, 379- 85, 386- 7 computing 366, 370, 375, 377, 379, 389, 391 confidence intervals 369- 70, 371, 376, 378- 9, 383- 5, 391 Cox regression 387- 93 design considerations 393 graphical presentation 369, 376, 385, 386- 7, 394 hazard 388, 390- 1 hazard function 388, 392 hazard ratio 375- 6, 383- 4, 385, 390 incorrect analyses 385- 7 Kaplan- Meier survival curve 368- 71, 377- 9, 384, 385, 386, 394 life table 368, 371, 394 logrank test 371- 5, 379- 83, 385, 386, 394 mean survival time 386 median survival time 369, 376, 384- 5, 386 presentation of results 393- 4 prognostic index 390- 1 proportional hazards regression 387- 93 response to treatment 387 sample size 376, 378, 386, 455 survival proportion/probability 366, 367- 71, 376, 377- 9, 384, 386- 7, 388, 390- 1 Sweeteners 102 Swimmers 250- 2, 269 Systematic allocation 446, 485, 494

t分布 165, 166, 181-2, 184, 219, 294, 296, 316, 340, N512
与正态分布的比较 181
t distribution 165, 166, 181- 2, 184, 219, 294, 296, 316, 340, N512 compared to Normal distribution 181

t分布(续) 521-2
方差分析后的t检验 211, 219-20
修正的t检验 211
单样本t检验 184-5, 191, 219, 397, 469
配对t检验 191, 199, 202, 221, 326, 328, 397, 402
t检验的呈现 220-2
两样本(非配对)t检验 192, 194, 197, 198, 199, 205, 206, 207, 209, 221, 319, 333, 426, 467, 469
细胞 126, 143, 200-5, 225
细胞 200-5
表格 42, 488
频数表 见 频数表
统计表 514-45
分布尾部 36, 58, 139, 166-7, 171, 181, 255, 257, 421
滑石粉 444
茶 102
牙齿
萌出年龄 31
龋齿、缺失和填充 500
一岁时数量 350
温度 41
检验统计量 166, 167, 221, 487
睾酮 273
理论分布 50-71, 171, 175
治疗试验 见 临床试验
甲状腺素 198
平级(并列排名) 173, 197, 265, 295, 334, 335
时间变化 101-2
隐藏效应 148-9
峰值相关数据 430
系列方法 360
趋势 148, 283, 488
跑一英里所需时间 317
滴度 145
肺总容量 (TLC) 364
转录错误 122-3, 124, 125
变换 N510
数据变换 36, 41-2, 108, 143, 180, 199, 204, 279, 303, 346, 350, 392, 420
t distribution (cont.) table of 521- 2 t test after analysis of variance 211, 219- 20 modified 211 one sample 184- 5, 191, 219, 397, 469 paired 191, 199, 202, 221, 326, 328, 397, 402 presentation of 220- 2 two sample (unpaired) 192, 194, 197, 198, 199, 205, 206, 207, 209, 221, 319, 333, 426, 467, 469 cells 126, 143, 200- 5, 225 cells 200- 5 Tables 42, 488 frequency see Frequency table statistical 514- 45 Tails of distribution 36, 58, 139, 166- 7, 171, 181, 255, 257, 421 Talc 444 Tea 102 Teeth age at eruption 31 decayed, missing and filled 500 number at one year 350 Temperature 41 Test statistic 166, 167, 221, 487 Testosterone 273 Theoretical distributions 50- 71, 171, 175 Therapeutic trial see Clinical trial Thyroxine 198 Ties (tied ranks) 173, 197, 265, 295, 334, 335 Time change over 101- 2 hidden effect of 148- 9 of peak 430 - related data 283, 426- 33 series methods 360 trend 148, 283, 488 Time to run a mile 317 Titre 145 Total lung capacity (TLC) 364 Transcription errors 122- 3, 124, 125 Transformation N510 of data 36, 41- 2, 108, 143, 180, 199, 204, 279, 303, 346, 350, 392, 420

变换(续)
线性变换 41
对数变换 36-7, 41-2, 60-2, 126, 136, 143-5, 199, 200-3, 205, 287, 303-6, 392, 400-1, N510
logit变换 145-6, 352
非线性变换 41
正态化变换 61-3, 133-6, 143-5, 199, 207, 421, N510
比例变换 145-6, 352
变换的理由 143-5
倒数变换 143-5
平方根变换 41, 143, 145, 202
二尖瓣流量(MF) 397-400
移植
骨髓 361, 395
心肺 364
肾脏 124, 145, 360
肝脏 368
肺 364
梯形法则 433
治疗分配 86, 88, 442-7, 450, 461, 485
治疗期交互作用 448, 467, 469
趋势 见 线性趋势;非线性趋势
真阳性 357
胰蛋白酶 212-13, 219-20
肿瘤反应 387
肿瘤大小 89, 145, 396
双胞胎 288-90, 295
双侧(双尾)检验 167, 170-1, 177, 214
第一类错误 169, 211, 457-9, N511
第二类错误 169, 457-9, N512
另见 功效
Transformation (cont.) linear 41 logarithmic 36- 7, 41- 2, 60- 2, 126, 136, 143- 5, 199, 200- 3, 205, 287, 303- 6, 392, 400- 1, N510 logit 145- 6, 352 non- linear 41 to Normality 61- 3, 133- 6, 143- 5, 199, 207, 421, N510 of proportion 145- 6, 352 rationale for 143- 5 reciprocal 143- 5 square root 41, 143, 145, 202 Transmitral volumetric flow (MF) 397- 400 Transplantation bone marrow 361, 395 heart- lung 364 kidney 124, 145, 360 liver 368 lung 364 Trapezium rule 433 Treatment allocation 86, 88, 442- 7, 450, 461, 485 Treatment- period interaction 448, 467, 469 Trend see Linear trend; Non- linear trend True positive 357 Trypsin 212- 13, 219- 20 Tumour response 387 size 89, 145, 396 Twins 288- 90, 295 Two- sided (two- tailed) test 167, 170- 1, 177, 214 Type I error 169, 211, 457- 9, N511 Type II error 169, 457- 9, N512 see also Power

溃疡性结肠炎 359
超声 82, 101, 267, 331, 484
紫外线辐射 216
不确定性 3, 48, 145, 153, 157, 169, 307, 321, 421
无对照试验 441, 478
水下直升机逃生 334-6
失业 103
Ulcerative colitis 359 Ultrasound 82, 101, 267, 331, 484 Ultraviolet radiation 216 Uncertainty 3, 48, 145, 153, 157, 169, 307, 321, 421 Uncontrolled trial 441, 478 Underwater helicopter escape 334- 6 Unemployment 103

均匀分布 71, 120, 146
Uniform distribution 71, 120, 146

研究单位 431, 466
Unit of investigation 431, 466

测量单位 41, 122
Units of measurement 41, 122

未婚母亲 102
Unmarried mothers 102

无配对 检验 见 检验,双样本
Unpaired test see test, two sample

上呼吸道感染 438
Upper respiratory tract infection 438

尿素氮 161
Urea nitrogen 161

尿酸 39
Uric acid 39

尿中可替宁排泄 226
Urinary cotinine excretion 226

尿流量 323
Urine flow 323

瓣膜性心脏病 363
Valvular heart disease 363

变异性 17, 19, 42, 51, 206, 221
Variability 17, 19, 42, 51, 206, 221

个体间 189, 401
between subjects 189, 401

描述见22-31页,398页,419页,487页,489页
description of 22- 31, 398, 419, 487, 489

解释见297页,308-309页,316页,340页,347页
explained 297, 308- 9, 316, 340, 347

定量见31-38页
quantification of 31- 8

来源见78页
sources of 78

受试者内见189页,206页
within subjects 189, 206

变量见17页,108页,N505-506页
Variable 17, 108, N505- 6

选择见337页,340-345页,359页,389页
selection 337, 340- 5, 359, 389

方差见34页,154页,192页
Variance 34, 154, 192

分析见方差分析
analysis of see Analysis of variance

合并见192页
pooled 192

比率 197-8
ratio 197- 8

不同研究组中相似 143, 180, 192, 197-8, 199-201, 206
similar in different study groups 143, 180, 192, 197- 8, 199- 201, 206

180, 192, 197-8, 199-201, 206
180, 192, 197- 8, 199- 201, 206

另见 标准差
see also Standard deviation

素食者 94
Vegetarian 94

通气 207
Ventilation 207

视力缺陷 254-7
Vision defects 254- 7

视觉模拟量表 (VAS) 15-16, 172
Visual analogue scale (VAS) 15- 16, 172

视觉显示单元(终端)(VDU) 72, 77, 91, 95, 259-61
Visual display units (terminals) (VDU) 72, 77, 91, 95, 259- 61

维生素补充 446-7, 453
Vitamin supplementation 446- 7, 453

志愿者偏倚 100, 446, 484
Volunteer bias 100, 446, 484

呕吐 368, 372-3, 376, 377-8
Vomiting 368, 372- 3, 376, 377- 8

W检验 139, 166, 279, 291
W test 139, 166, 279, 291

W检验 139, 291, 303-6, 330
W test 139, 291, 303- 6, 330

表格 538-9
table for 538- 9

洗脱期 448, 469, 471
Wash- out period 448, 469, 471


Water

氯化水 250, 269
chlorinated 250, 269

水(续)
Water (cont.)

含氟水 1, 90, 500
fluoridated 1, 90, 500

楔压 149
Wedge pressure 149

体重 12, 59, 84, 279, 298, 336, 343-4, 345, 347, 423
Weight 12, 59, 84, 279, 298, 336, 343- 4, 345, 347, 423

出生 见 出生体重
birth see Birthweight

胎儿 279
fetal 279

妊娠期间的增重 6
gain during pregnancy 6

母体 279
maternal 279

Welch检验 198
Welch test 198

白细胞计数 90
White cell counts 90

Wilcoxon检验
Wilcoxon test

单样本(符号秩和检验)187-9,531页表格
one sample (signed rank sum) 187- 9 table for 531

两个配对样本 191, 203-5, 266, 336
two paired samples 191, 203- 5, 266, 336

336

数据变换的影响
effect of transformation of data

203- 5

531 的表格
table for 531

两个非配对(独立)样本,见 Mann-Whitney 检验
two unpaired (independent) samples see Mann- Whitney test

退出 132, 366, 447, 463, 471, 473, 490
Withdrawals 132, 366, 447, 463, 471, 473, 490

473,490

组内(配对)比较 448
Within group (paired) comparison 448

工作史 95
Working history 95

22,N509,N511
22, N509, N511

干乳房摄影 403-4
Xeromammogram 403- 4

Yates 校正 252-3, 260
Yates' correction 252- 3, 260

z 检验 167, 171, 198
z test 167, 171, 198

z 值 见 正态分数
z value see Normal score

Zelen 设计 449
Zelen's design 449

零一变量 见 二元变量
Zero- one variable see Binary variable

169, N511
169, N511

169, N512
169, N512

54, N511
54, N511

N509, N511

35, 54, N511
35, 54, N511

35, N508, N511
35, N508, N511

见 卡方检验
see Chi squared test

36,221,488,N509
36, 221, 488, N509

!(阶乘)70,256,N509,N513
! (factorial) 70, 256, N509, N513

N510,N513
N510, N513